···3232 - const: qcom,msm-iommu-v233333434 clocks:3535+ minItems: 23536 items:3637 - description: Clock required for IOMMU register group access3738 - description: Clock required for underlying bus access3939+ - description: Clock required for Translation Buffer Unit access38403941 clock-names:4242+ minItems: 24043 items:4144 - const: iface4245 - const: bus4646+ - const: tbu43474448 power-domains:4549 maxItems: 1
+137
Documentation/driver-api/generic_pt.rst
···11+.. SPDX-License-Identifier: GPL-2.022+33+========================44+Generic Radix Page Table55+========================66+77+.. kernel-doc:: include/linux/generic_pt/common.h88+ :doc: Generic Radix Page Table99+1010+.. kernel-doc:: drivers/iommu/generic_pt/pt_defs.h1111+ :doc: Generic Page Table Language1212+1313+Usage1414+=====1515+1616+Generic PT is structured as a multi-compilation system. Since each format1717+provides an API using a common set of names there can be only one format active1818+within a compilation unit. This design avoids function pointers around the low1919+level API.2020+2121+Instead the function pointers can end up at the higher level API (i.e.2222+map/unmap, etc.) and the per-format code can be directly inlined into the2323+per-format compilation unit. For something like IOMMU each format will be2424+compiled into a per-format IOMMU operations kernel module.2525+2626+For this to work the .c file for each compilation unit will include both the2727+format headers and the generic code for the implementation. For instance in an2828+implementation compilation unit the headers would normally be included as2929+follows:3030+3131+generic_pt/fmt/iommu_amdv1.c::3232+3333+ #include <linux/generic_pt/common.h>3434+ #include "defs_amdv1.h"3535+ #include "../pt_defs.h"3636+ #include "amdv1.h"3737+ #include "../pt_common.h"3838+ #include "../pt_iter.h"3939+ #include "../iommu_pt.h" /* The IOMMU implementation */4040+4141+iommu_pt.h includes definitions that will generate the operations functions for4242+map/unmap/etc. using the definitions provided by AMDv1. The resulting module4343+will have exported symbols named like pt_iommu_amdv1_init().4444+4545+Refer to drivers/iommu/generic_pt/fmt/iommu_template.h for an example of how the4646+IOMMU implementation uses multi-compilation to generate per-format ops structs4747+pointers.4848+4949+The format code is written so that the common names arise from #defines to5050+distinct format specific names. This is intended to aid debuggability by5151+avoiding symbol clashes across all the different formats.5252+5353+Exported symbols and other global names are mangled using a per-format string5454+via the NS() helper macro.5555+5656+The format uses struct pt_common as the top-level struct for the table,5757+and each format will have its own struct pt_xxx which embeds it to store5858+format-specific information.5959+6060+The implementation will further wrap struct pt_common in its own top-level6161+struct, such as struct pt_iommu_amdv1.6262+6363+Format functions at the struct pt_common level6464+----------------------------------------------6565+6666+.. kernel-doc:: include/linux/generic_pt/common.h6767+ :identifiers:6868+.. kernel-doc:: drivers/iommu/generic_pt/pt_common.h6969+7070+Iteration Helpers7171+-----------------7272+7373+.. kernel-doc:: drivers/iommu/generic_pt/pt_iter.h7474+7575+Writing a Format7676+----------------7777+7878+It is best to start from a simple format that is similar to the target. x86_647979+is usually a good reference for something simple, and AMDv1 is something fairly8080+complete.8181+8282+The required inline functions need to be implemented in the format header.8383+These should all follow the standard pattern of::8484+8585+ static inline pt_oaddr_t amdv1pt_entry_oa(const struct pt_state *pts)8686+ {8787+ [..]8888+ }8989+ #define pt_entry_oa amdv1pt_entry_oa9090+9191+where a uniquely named per-format inline function provides the implementation9292+and a define maps it to the generic name. This is intended to make debug symbols9393+work better. inline functions should always be used as the prototypes in9494+pt_common.h will cause the compiler to validate the function signature to9595+prevent errors.9696+9797+Review pt_fmt_defaults.h to understand some of the optional inlines.9898+9999+Once the format compiles then it should be run through the generic page table100100+kunit test in kunit_generic_pt.h using kunit. For example::101101+102102+ $ tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig amdv1_fmt_test.*103103+ [...]104104+ [11:15:08] Testing complete. Ran 9 tests: passed: 9105105+ [11:15:09] Elapsed time: 3.137s total, 0.001s configuring, 2.368s building, 0.311s running106106+107107+The generic tests are intended to prove out the format functions and give108108+clearer failures to speed up finding the problems. Once those pass then the109109+entire kunit suite should be run.110110+111111+IOMMU Invalidation Features112112+---------------------------113113+114114+Invalidation is how the page table algorithms synchronize with a HW cache of the115115+page table memory, typically called the TLB (or IOTLB for IOMMU cases).116116+117117+The TLB can store present PTEs, non-present PTEs and table pointers, depending118118+on its design. Every HW has its own approach on how to describe what has changed119119+to have changed items removed from the TLB.120120+121121+PT_FEAT_FLUSH_RANGE122122+~~~~~~~~~~~~~~~~~~~123123+124124+PT_FEAT_FLUSH_RANGE is the easiest scheme to understand. It tries to generate a125125+single range invalidation for each operation, over-invalidating if there are126126+gaps of VA that don't need invalidation. This trades off impacted VA for number127127+of invalidation operations. It does not keep track of what is being invalidated;128128+however, if pages have to be freed then page table pointers have to be cleaned129129+from the walk cache. The range can start/end at any page boundary.130130+131131+PT_FEAT_FLUSH_RANGE_NO_GAPS132132+~~~~~~~~~~~~~~~~~~~~~~~~~~~133133+134134+PT_FEAT_FLUSH_RANGE_NO_GAPS is similar to PT_FEAT_FLUSH_RANGE; however, it tries135135+to minimize the amount of impacted VA by issuing extra flush operations. This is136136+useful if the cost of processing VA is very high, for instance because a137137+hypervisor is processing the page table with a shadowing algorithm.
···55if ARM_AMBA6677config TEGRA_AHB88- bool88+ bool "Enable AHB driver for NVIDIA Tegra SoCs" if COMPILE_TEST99 default y if ARCH_TEGRA1010 help1111 Adds AHB configuration functionality for NVIDIA Tegra SoCs,
+9-6
drivers/iommu/Kconfig
···4040 sizes at both stage-1 and stage-2, as well as address spaces4141 up to 48-bits in size.42424343-config IOMMU_IO_PGTABLE_LPAE_SELFTEST4444- bool "LPAE selftests"4545- depends on IOMMU_IO_PGTABLE_LPAE4343+config IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST4444+ tristate "KUnit tests for LPAE"4545+ depends on IOMMU_IO_PGTABLE_LPAE && KUNIT4646+ default KUNIT_ALL_TESTS4647 help4747- Enable self-tests for LPAE page table allocator. This performs4848- a series of page-table consistency checks during boot.4848+ Enable kunit tests for LPAE page table allocator. This performs4949+ a series of page-table consistency checks.49505051 If unsure, say N here.5152···248247249248config TEGRA_IOMMU_SMMU250249 bool "NVIDIA Tegra SMMU Support"251251- depends on ARCH_TEGRA250250+ depends on ARCH_TEGRA || COMPILE_TEST252251 depends on TEGRA_AHB253252 depends on TEGRA_MC254253 select IOMMU_API···385384 Say Y here if you want to use the multimedia devices listed above.386385387386endif # IOMMU_SUPPORT387387+388388+source "drivers/iommu/generic_pt/Kconfig"
···14641464 cd_table->l2.l1tab = dma_alloc_coherent(smmu->dev, l1size,14651465 &cd_table->cdtab_dma,14661466 GFP_KERNEL);14671467- if (!cd_table->l2.l2ptrs) {14671467+ if (!cd_table->l2.l1tab) {14681468 ret = -ENOMEM;14691469 goto err_free_l2ptrs;14701470 }···30023002 master->ats_enabled = state->ats_enabled;30033003}3004300430053005-static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)30053005+static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev,30063006+ struct iommu_domain *old_domain)30063007{30073008 int ret = 0;30083009 struct arm_smmu_ste target;···30113010 struct arm_smmu_device *smmu;30123011 struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);30133012 struct arm_smmu_attach_state state = {30143014- .old_domain = iommu_get_domain_for_dev(dev),30133013+ .old_domain = old_domain,30153014 .ssid = IOMMU_NO_PASID,30163015 };30173016 struct arm_smmu_master *master;···3187318631883187 /*31893188 * When the last user of the CD table goes away downgrade the STE back31903190- * to a non-cd_table one.31893189+ * to a non-cd_table one, by re-attaching its sid_domain.31913190 */31923191 if (!arm_smmu_ssids_in_use(&master->cd_table)) {31933192 struct iommu_domain *sid_domain =···3195319431963195 if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||31973196 sid_domain->type == IOMMU_DOMAIN_BLOCKED)31983198- sid_domain->ops->attach_dev(sid_domain, dev);31973197+ sid_domain->ops->attach_dev(sid_domain, dev,31983198+ sid_domain);31993199 }32003200 return 0;32013201}3202320232033203static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,32043204+ struct iommu_domain *old_domain,32043205 struct device *dev,32053206 struct arm_smmu_ste *ste,32063207 unsigned int s1dss)···32103207 struct arm_smmu_master *master = dev_iommu_priv_get(dev);32113208 struct arm_smmu_attach_state state = {32123209 .master = master,32133213- .old_domain = iommu_get_domain_for_dev(dev),32103210+ .old_domain = old_domain,32143211 .ssid = IOMMU_NO_PASID,32153212 };32163213···32513248}3252324932533250static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,32543254- struct device *dev)32513251+ struct device *dev,32523252+ struct iommu_domain *old_domain)32553253{32563254 struct arm_smmu_ste ste;32573255 struct arm_smmu_master *master = dev_iommu_priv_get(dev);3258325632593257 arm_smmu_master_clear_vmaster(master);32603258 arm_smmu_make_bypass_ste(master->smmu, &ste);32613261- arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);32593259+ arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,32603260+ STRTAB_STE_1_S1DSS_BYPASS);32623261 return 0;32633262}32643263···32743269};3275327032763271static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,32773277- struct device *dev)32723272+ struct device *dev,32733273+ struct iommu_domain *old_domain)32783274{32793275 struct arm_smmu_ste ste;32803276 struct arm_smmu_master *master = dev_iommu_priv_get(dev);3281327732823278 arm_smmu_master_clear_vmaster(master);32833279 arm_smmu_make_abort_ste(&ste);32843284- arm_smmu_attach_dev_ste(domain, dev, &ste,32803280+ arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,32853281 STRTAB_STE_1_S1DSS_TERMINATE);32863282 return 0;32873283}···3588358235893583 WARN_ON(master->iopf_refcount);3590358435913591- /* Put the STE back to what arm_smmu_init_strtab() sets */35923592- if (dev->iommu->require_direct)35933593- arm_smmu_attach_dev_identity(&arm_smmu_identity_domain, dev);35943594- else35953595- arm_smmu_attach_dev_blocked(&arm_smmu_blocked_domain, dev);35963596-35973585 arm_smmu_disable_pasid(master);35983586 arm_smmu_remove_master(master);35993587 if (arm_smmu_cdtab_allocated(&master->cd_table))···36783678static const struct iommu_ops arm_smmu_ops = {36793679 .identity_domain = &arm_smmu_identity_domain,36803680 .blocked_domain = &arm_smmu_blocked_domain,36813681+ .release_domain = &arm_smmu_blocked_domain,36813682 .capable = arm_smmu_capable,36823683 .hw_info = arm_smmu_hw_info,36833684 .domain_alloc_sva = arm_smmu_sva_domain_alloc,
+18-10
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
···367367static const struct of_device_id qcom_smmu_client_of_match[] __maybe_unused = {368368 { .compatible = "qcom,adreno" },369369 { .compatible = "qcom,adreno-gmu" },370370+ { .compatible = "qcom,glymur-mdss" },370371 { .compatible = "qcom,mdp4" },371372 { .compatible = "qcom,mdss" },372373 { .compatible = "qcom,qcm2290-mdss" },···432431433432 /*434433 * Some platforms support more than the Arm SMMU architected maximum of435435- * 128 stream matching groups. For unknown reasons, the additional436436- * groups don't exhibit the same behavior as the architected registers,437437- * so limit the groups to 128 until the behavior is fixed for the other438438- * groups.434434+ * 128 stream matching groups. The additional registers appear to have435435+ * the same behavior as the architected registers in the hardware.436436+ * However, on some firmware versions, the hypervisor does not437437+ * correctly trap and emulate accesses to the additional registers,438438+ * resulting in unexpected behavior.439439+ *440440+ * If there are more than 128 groups, use the last reliable group to441441+ * detect if we need to apply the bypass quirk.439442 */440440- if (smmu->num_mapping_groups > 128) {441441- dev_notice(smmu->dev, "\tLimiting the stream matching groups to 128\n");442442- smmu->num_mapping_groups = 128;443443- }444444-445445- last_s2cr = ARM_SMMU_GR0_S2CR(smmu->num_mapping_groups - 1);443443+ if (smmu->num_mapping_groups > 128)444444+ last_s2cr = ARM_SMMU_GR0_S2CR(127);445445+ else446446+ last_s2cr = ARM_SMMU_GR0_S2CR(smmu->num_mapping_groups - 1);446447447448 /*448449 * With some firmware versions writes to S2CR of type FAULT are···467464468465 reg = FIELD_PREP(ARM_SMMU_CBAR_TYPE, CBAR_TYPE_S1_TRANS_S2_BYPASS);469466 arm_smmu_gr1_write(smmu, ARM_SMMU_GR1_CBAR(qsmmu->bypass_cbndx), reg);467467+468468+ if (smmu->num_mapping_groups > 128) {469469+ dev_notice(smmu->dev, "\tLimiting the stream matching groups to 128\n");470470+ smmu->num_mapping_groups = 128;471471+ }470472 }471473472474 for (i = 0; i < smmu->num_mapping_groups; i++) {
···11+# SPDX-License-Identifier: GPL-2.0-only22+33+menuconfig GENERIC_PT44+ bool "Generic Radix Page Table" if COMPILE_TEST55+ help66+ Generic library for building radix tree page tables.77+88+ Generic PT provides a set of HW page table formats and a common99+ set of APIs to work with them.1010+1111+if GENERIC_PT1212+config DEBUG_GENERIC_PT1313+ bool "Extra debugging checks for GENERIC_PT"1414+ help1515+ Enable extra run time debugging checks for GENERIC_PT code. This1616+ incurs a runtime cost and should not be enabled for production1717+ kernels.1818+1919+ The kunit tests require this to be enabled to get full coverage.2020+2121+config IOMMU_PT2222+ tristate "IOMMU Page Tables"2323+ select IOMMU_API2424+ depends on IOMMU_SUPPORT2525+ depends on GENERIC_PT2626+ help2727+ Generic library for building IOMMU page tables2828+2929+ IOMMU_PT provides an implementation of the page table operations3030+ related to struct iommu_domain using GENERIC_PT. It provides a single3131+ implementation of the page table operations that can be shared by3232+ multiple drivers.3333+3434+if IOMMU_PT3535+config IOMMU_PT_AMDV13636+ tristate "IOMMU page table for 64-bit AMD IOMMU v1"3737+ depends on !GENERIC_ATOMIC64 # for cmpxchg643838+ help3939+ iommu_domain implementation for the AMD v1 page table. AMDv1 is the4040+ "host" page table. It supports granular page sizes of almost every4141+ power of 2 and decodes the full 64-bit IOVA space.4242+4343+ Selected automatically by an IOMMU driver that uses this format.4444+4545+config IOMMU_PT_VTDSS4646+ tristate "IOMMU page table for Intel VT-d Second Stage"4747+ depends on !GENERIC_ATOMIC64 # for cmpxchg644848+ help4949+ iommu_domain implementation for the Intel VT-d's 64 bit 3/4/55050+ level Second Stage page table. It is similar to the X86_64 format with5151+ 4K/2M/1G page sizes.5252+5353+ Selected automatically by an IOMMU driver that uses this format.5454+5555+config IOMMU_PT_X86_645656+ tristate "IOMMU page table for x86 64-bit, 4/5 levels"5757+ depends on !GENERIC_ATOMIC64 # for cmpxchg645858+ help5959+ iommu_domain implementation for the x86 64-bit 4/5 level page table.6060+ It supports 4K/2M/1G page sizes and can decode a sign-extended6161+ portion of the 64-bit IOVA space.6262+6363+ Selected automatically by an IOMMU driver that uses this format.6464+6565+config IOMMU_PT_KUNIT_TEST6666+ tristate "IOMMU Page Table KUnit Test" if !KUNIT_ALL_TESTS6767+ depends on KUNIT6868+ depends on IOMMU_PT_AMDV1 || !IOMMU_PT_AMDV16969+ depends on IOMMU_PT_X86_64 || !IOMMU_PT_X86_647070+ depends on IOMMU_PT_VTDSS || !IOMMU_PT_VTDSS7171+ default KUNIT_ALL_TESTS7272+ help7373+ Enable kunit tests for GENERIC_PT and IOMMU_PT that covers all the7474+ enabled page table formats. The test covers most of the GENERIC_PT7575+ functions provided by the page table format, as well as covering the7676+ iommu_domain related functions.7777+7878+endif7979+endif
+28
drivers/iommu/generic_pt/fmt/Makefile
···11+# SPDX-License-Identifier: GPL-2.022+33+iommu_pt_fmt-$(CONFIG_IOMMU_PT_AMDV1) += amdv144+iommu_pt_fmt-$(CONFIG_IOMMUFD_TEST) += mock55+66+iommu_pt_fmt-$(CONFIG_IOMMU_PT_VTDSS) += vtdss77+88+iommu_pt_fmt-$(CONFIG_IOMMU_PT_X86_64) += x86_6499+1010+IOMMU_PT_KUNIT_TEST :=1111+define create_format1212+obj-$(2) += iommu_$(1).o1313+iommu_pt_kunit_test-y += kunit_iommu_$(1).o1414+CFLAGS_kunit_iommu_$(1).o += -DGENERIC_PT_KUNIT=11515+IOMMU_PT_KUNIT_TEST := iommu_pt_kunit_test.o1616+1717+endef1818+1919+$(eval $(foreach fmt,$(iommu_pt_fmt-y),$(call create_format,$(fmt),y)))2020+$(eval $(foreach fmt,$(iommu_pt_fmt-m),$(call create_format,$(fmt),m)))2121+2222+# The kunit objects are constructed by compiling the main source2323+# with -DGENERIC_PT_KUNIT2424+$(obj)/kunit_iommu_%.o: $(src)/iommu_%.c FORCE2525+ $(call rule_mkdir)2626+ $(call if_changed_dep,cc_o_c)2727+2828+obj-$(CONFIG_IOMMU_PT_KUNIT_TEST) += $(IOMMU_PT_KUNIT_TEST)
+411
drivers/iommu/generic_pt/fmt/amdv1.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * AMD IOMMU v1 page table66+ *77+ * This is described in Section "2.2.3 I/O Page Tables for Host Translations"88+ * of the "AMD I/O Virtualization Technology (IOMMU) Specification"99+ *1010+ * Note the level numbering here matches the core code, so level 0 is the same1111+ * as mode 1.1212+ *1313+ */1414+#ifndef __GENERIC_PT_FMT_AMDV1_H1515+#define __GENERIC_PT_FMT_AMDV1_H1616+1717+#include "defs_amdv1.h"1818+#include "../pt_defs.h"1919+2020+#include <asm/page.h>2121+#include <linux/bitfield.h>2222+#include <linux/container_of.h>2323+#include <linux/mem_encrypt.h>2424+#include <linux/minmax.h>2525+#include <linux/sizes.h>2626+#include <linux/string.h>2727+2828+enum {2929+ PT_ITEM_WORD_SIZE = sizeof(u64),3030+ /*3131+ * The IOMMUFD selftest uses the AMDv1 format with some alterations It3232+ * uses a 2k page size to test cases where the CPU page size is not the3333+ * same.3434+ */3535+#ifdef AMDV1_IOMMUFD_SELFTEST3636+ PT_MAX_VA_ADDRESS_LG2 = 56,3737+ PT_MAX_OUTPUT_ADDRESS_LG2 = 51,3838+ PT_MAX_TOP_LEVEL = 4,3939+ PT_GRANULE_LG2SZ = 11,4040+#else4141+ PT_MAX_VA_ADDRESS_LG2 = 64,4242+ PT_MAX_OUTPUT_ADDRESS_LG2 = 52,4343+ PT_MAX_TOP_LEVEL = 5,4444+ PT_GRANULE_LG2SZ = 12,4545+#endif4646+ PT_TABLEMEM_LG2SZ = 12,4747+4848+ /* The DTE only has these bits for the top phyiscal address */4949+ PT_TOP_PHYS_MASK = GENMASK_ULL(51, 12),5050+};5151+5252+/* PTE bits */5353+enum {5454+ AMDV1PT_FMT_PR = BIT(0),5555+ AMDV1PT_FMT_D = BIT(6),5656+ AMDV1PT_FMT_NEXT_LEVEL = GENMASK_ULL(11, 9),5757+ AMDV1PT_FMT_OA = GENMASK_ULL(51, 12),5858+ AMDV1PT_FMT_FC = BIT_ULL(60),5959+ AMDV1PT_FMT_IR = BIT_ULL(61),6060+ AMDV1PT_FMT_IW = BIT_ULL(62),6161+};6262+6363+/*6464+ * gcc 13 has a bug where it thinks the output of FIELD_GET() is an enum, make6565+ * these defines to avoid it.6666+ */6767+#define AMDV1PT_FMT_NL_DEFAULT 06868+#define AMDV1PT_FMT_NL_SIZE 76969+7070+static inline pt_oaddr_t amdv1pt_table_pa(const struct pt_state *pts)7171+{7272+ u64 entry = pts->entry;7373+7474+ if (pts_feature(pts, PT_FEAT_AMDV1_ENCRYPT_TABLES))7575+ entry = __sme_clr(entry);7676+ return oalog2_mul(FIELD_GET(AMDV1PT_FMT_OA, entry), PT_GRANULE_LG2SZ);7777+}7878+#define pt_table_pa amdv1pt_table_pa7979+8080+/* Returns the oa for the start of the contiguous entry */8181+static inline pt_oaddr_t amdv1pt_entry_oa(const struct pt_state *pts)8282+{8383+ u64 entry = pts->entry;8484+ pt_oaddr_t oa;8585+8686+ if (pts_feature(pts, PT_FEAT_AMDV1_ENCRYPT_TABLES))8787+ entry = __sme_clr(entry);8888+ oa = FIELD_GET(AMDV1PT_FMT_OA, entry);8989+9090+ if (FIELD_GET(AMDV1PT_FMT_NEXT_LEVEL, entry) == AMDV1PT_FMT_NL_SIZE) {9191+ unsigned int sz_bits = oaffz(oa);9292+9393+ oa = oalog2_set_mod(oa, 0, sz_bits);9494+ } else if (PT_WARN_ON(FIELD_GET(AMDV1PT_FMT_NEXT_LEVEL, entry) !=9595+ AMDV1PT_FMT_NL_DEFAULT))9696+ return 0;9797+ return oalog2_mul(oa, PT_GRANULE_LG2SZ);9898+}9999+#define pt_entry_oa amdv1pt_entry_oa100100+101101+static inline bool amdv1pt_can_have_leaf(const struct pt_state *pts)102102+{103103+ /*104104+ * Table 15: Page Table Level Parameters105105+ * The top most level cannot have translation entries106106+ */107107+ return pts->level < PT_MAX_TOP_LEVEL;108108+}109109+#define pt_can_have_leaf amdv1pt_can_have_leaf110110+111111+/* Body in pt_fmt_defaults.h */112112+static inline unsigned int pt_table_item_lg2sz(const struct pt_state *pts);113113+114114+static inline unsigned int115115+amdv1pt_entry_num_contig_lg2(const struct pt_state *pts)116116+{117117+ u32 code;118118+119119+ if (FIELD_GET(AMDV1PT_FMT_NEXT_LEVEL, pts->entry) ==120120+ AMDV1PT_FMT_NL_DEFAULT)121121+ return ilog2(1);122122+123123+ PT_WARN_ON(FIELD_GET(AMDV1PT_FMT_NEXT_LEVEL, pts->entry) !=124124+ AMDV1PT_FMT_NL_SIZE);125125+126126+ /*127127+ * The contiguous size is encoded in the length of a string of 1's in128128+ * the low bits of the OA. Reverse the equation:129129+ * code = log2_to_int(num_contig_lg2 + item_lg2sz -130130+ * PT_GRANULE_LG2SZ - 1) - 1131131+ * Which can be expressed as:132132+ * num_contig_lg2 = oalog2_ffz(code) + 1 -133133+ * item_lg2sz - PT_GRANULE_LG2SZ134134+ *135135+ * Assume the bit layout is correct and remove the masking. Reorganize136136+ * the equation to move all the arithmetic before the ffz.137137+ */138138+ code = pts->entry >> (__bf_shf(AMDV1PT_FMT_OA) - 1 +139139+ pt_table_item_lg2sz(pts) - PT_GRANULE_LG2SZ);140140+ return ffz_t(u32, code);141141+}142142+#define pt_entry_num_contig_lg2 amdv1pt_entry_num_contig_lg2143143+144144+static inline unsigned int amdv1pt_num_items_lg2(const struct pt_state *pts)145145+{146146+ /*147147+ * Top entry covers bits [63:57] only, this is handled through148148+ * max_vasz_lg2.149149+ */150150+ if (PT_WARN_ON(pts->level == 5))151151+ return 7;152152+ return PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64));153153+}154154+#define pt_num_items_lg2 amdv1pt_num_items_lg2155155+156156+static inline pt_vaddr_t amdv1pt_possible_sizes(const struct pt_state *pts)157157+{158158+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);159159+160160+ if (!amdv1pt_can_have_leaf(pts))161161+ return 0;162162+163163+ /*164164+ * Table 14: Example Page Size Encodings165165+ * Address bits 51:32 can be used to encode page sizes greater than 4166166+ * Gbytes. Address bits 63:52 are zero-extended.167167+ *168168+ * 512GB Pages are not supported due to a hardware bug.169169+ * Otherwise every power of two size is supported.170170+ */171171+ return GENMASK_ULL(min(51, isz_lg2 + amdv1pt_num_items_lg2(pts) - 1),172172+ isz_lg2) & ~SZ_512G;173173+}174174+#define pt_possible_sizes amdv1pt_possible_sizes175175+176176+static inline enum pt_entry_type amdv1pt_load_entry_raw(struct pt_state *pts)177177+{178178+ const u64 *tablep = pt_cur_table(pts, u64) + pts->index;179179+ unsigned int next_level;180180+ u64 entry;181181+182182+ pts->entry = entry = READ_ONCE(*tablep);183183+ if (!(entry & AMDV1PT_FMT_PR))184184+ return PT_ENTRY_EMPTY;185185+186186+ next_level = FIELD_GET(AMDV1PT_FMT_NEXT_LEVEL, pts->entry);187187+ if (pts->level == 0 || next_level == AMDV1PT_FMT_NL_DEFAULT ||188188+ next_level == AMDV1PT_FMT_NL_SIZE)189189+ return PT_ENTRY_OA;190190+ return PT_ENTRY_TABLE;191191+}192192+#define pt_load_entry_raw amdv1pt_load_entry_raw193193+194194+static inline void195195+amdv1pt_install_leaf_entry(struct pt_state *pts, pt_oaddr_t oa,196196+ unsigned int oasz_lg2,197197+ const struct pt_write_attrs *attrs)198198+{199199+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);200200+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;201201+ u64 entry;202202+203203+ if (!pt_check_install_leaf_args(pts, oa, oasz_lg2))204204+ return;205205+206206+ entry = AMDV1PT_FMT_PR |207207+ FIELD_PREP(AMDV1PT_FMT_OA, log2_div(oa, PT_GRANULE_LG2SZ)) |208208+ attrs->descriptor_bits;209209+210210+ if (oasz_lg2 == isz_lg2) {211211+ entry |= FIELD_PREP(AMDV1PT_FMT_NEXT_LEVEL,212212+ AMDV1PT_FMT_NL_DEFAULT);213213+ WRITE_ONCE(*tablep, entry);214214+ } else {215215+ unsigned int num_contig_lg2 = oasz_lg2 - isz_lg2;216216+ u64 *end = tablep + log2_to_int(num_contig_lg2);217217+218218+ entry |= FIELD_PREP(AMDV1PT_FMT_NEXT_LEVEL,219219+ AMDV1PT_FMT_NL_SIZE) |220220+ FIELD_PREP(AMDV1PT_FMT_OA,221221+ oalog2_to_int(oasz_lg2 - PT_GRANULE_LG2SZ -222222+ 1) -223223+ 1);224224+225225+ /* See amdv1pt_clear_entries() */226226+ if (num_contig_lg2 <= ilog2(32)) {227227+ for (; tablep != end; tablep++)228228+ WRITE_ONCE(*tablep, entry);229229+ } else {230230+ memset64(tablep, entry, log2_to_int(num_contig_lg2));231231+ }232232+ }233233+ pts->entry = entry;234234+}235235+#define pt_install_leaf_entry amdv1pt_install_leaf_entry236236+237237+static inline bool amdv1pt_install_table(struct pt_state *pts,238238+ pt_oaddr_t table_pa,239239+ const struct pt_write_attrs *attrs)240240+{241241+ u64 entry;242242+243243+ /*244244+ * IR and IW are ANDed from the table levels along with the PTE. We245245+ * always control permissions from the PTE, so always set IR and IW for246246+ * tables.247247+ */248248+ entry = AMDV1PT_FMT_PR |249249+ FIELD_PREP(AMDV1PT_FMT_NEXT_LEVEL, pts->level) |250250+ FIELD_PREP(AMDV1PT_FMT_OA,251251+ log2_div(table_pa, PT_GRANULE_LG2SZ)) |252252+ AMDV1PT_FMT_IR | AMDV1PT_FMT_IW;253253+ if (pts_feature(pts, PT_FEAT_AMDV1_ENCRYPT_TABLES))254254+ entry = __sme_set(entry);255255+ return pt_table_install64(pts, entry);256256+}257257+#define pt_install_table amdv1pt_install_table258258+259259+static inline void amdv1pt_attr_from_entry(const struct pt_state *pts,260260+ struct pt_write_attrs *attrs)261261+{262262+ attrs->descriptor_bits =263263+ pts->entry & (AMDV1PT_FMT_FC | AMDV1PT_FMT_IR | AMDV1PT_FMT_IW);264264+}265265+#define pt_attr_from_entry amdv1pt_attr_from_entry266266+267267+static inline void amdv1pt_clear_entries(struct pt_state *pts,268268+ unsigned int num_contig_lg2)269269+{270270+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;271271+ u64 *end = tablep + log2_to_int(num_contig_lg2);272272+273273+ /*274274+ * gcc generates rep stos for the io-pgtable code, and this difference275275+ * can show in microbenchmarks with larger contiguous page sizes.276276+ * rep is slower for small cases.277277+ */278278+ if (num_contig_lg2 <= ilog2(32)) {279279+ for (; tablep != end; tablep++)280280+ WRITE_ONCE(*tablep, 0);281281+ } else {282282+ memset64(tablep, 0, log2_to_int(num_contig_lg2));283283+ }284284+}285285+#define pt_clear_entries amdv1pt_clear_entries286286+287287+static inline bool amdv1pt_entry_is_write_dirty(const struct pt_state *pts)288288+{289289+ unsigned int num_contig_lg2 = amdv1pt_entry_num_contig_lg2(pts);290290+ u64 *tablep = pt_cur_table(pts, u64) +291291+ log2_set_mod(pts->index, 0, num_contig_lg2);292292+ u64 *end = tablep + log2_to_int(num_contig_lg2);293293+294294+ for (; tablep != end; tablep++)295295+ if (READ_ONCE(*tablep) & AMDV1PT_FMT_D)296296+ return true;297297+ return false;298298+}299299+#define pt_entry_is_write_dirty amdv1pt_entry_is_write_dirty300300+301301+static inline void amdv1pt_entry_make_write_clean(struct pt_state *pts)302302+{303303+ unsigned int num_contig_lg2 = amdv1pt_entry_num_contig_lg2(pts);304304+ u64 *tablep = pt_cur_table(pts, u64) +305305+ log2_set_mod(pts->index, 0, num_contig_lg2);306306+ u64 *end = tablep + log2_to_int(num_contig_lg2);307307+308308+ for (; tablep != end; tablep++)309309+ WRITE_ONCE(*tablep, READ_ONCE(*tablep) & ~(u64)AMDV1PT_FMT_D);310310+}311311+#define pt_entry_make_write_clean amdv1pt_entry_make_write_clean312312+313313+static inline bool amdv1pt_entry_make_write_dirty(struct pt_state *pts)314314+{315315+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;316316+ u64 new = pts->entry | AMDV1PT_FMT_D;317317+318318+ return try_cmpxchg64(tablep, &pts->entry, new);319319+}320320+#define pt_entry_make_write_dirty amdv1pt_entry_make_write_dirty321321+322322+/* --- iommu */323323+#include <linux/generic_pt/iommu.h>324324+#include <linux/iommu.h>325325+326326+#define pt_iommu_table pt_iommu_amdv1327327+328328+/* The common struct is in the per-format common struct */329329+static inline struct pt_common *common_from_iommu(struct pt_iommu *iommu_table)330330+{331331+ return &container_of(iommu_table, struct pt_iommu_amdv1, iommu)332332+ ->amdpt.common;333333+}334334+335335+static inline struct pt_iommu *iommu_from_common(struct pt_common *common)336336+{337337+ return &container_of(common, struct pt_iommu_amdv1, amdpt.common)->iommu;338338+}339339+340340+static inline int amdv1pt_iommu_set_prot(struct pt_common *common,341341+ struct pt_write_attrs *attrs,342342+ unsigned int iommu_prot)343343+{344344+ u64 pte = 0;345345+346346+ if (pt_feature(common, PT_FEAT_AMDV1_FORCE_COHERENCE))347347+ pte |= AMDV1PT_FMT_FC;348348+ if (iommu_prot & IOMMU_READ)349349+ pte |= AMDV1PT_FMT_IR;350350+ if (iommu_prot & IOMMU_WRITE)351351+ pte |= AMDV1PT_FMT_IW;352352+353353+ /*354354+ * Ideally we'd have an IOMMU_ENCRYPTED flag set by higher levels to355355+ * control this. For now if the tables use sme_set then so do the ptes.356356+ */357357+ if (pt_feature(common, PT_FEAT_AMDV1_ENCRYPT_TABLES))358358+ pte = __sme_set(pte);359359+360360+ attrs->descriptor_bits = pte;361361+ return 0;362362+}363363+#define pt_iommu_set_prot amdv1pt_iommu_set_prot364364+365365+static inline int amdv1pt_iommu_fmt_init(struct pt_iommu_amdv1 *iommu_table,366366+ const struct pt_iommu_amdv1_cfg *cfg)367367+{368368+ struct pt_amdv1 *table = &iommu_table->amdpt;369369+ unsigned int max_vasz_lg2 = PT_MAX_VA_ADDRESS_LG2;370370+371371+ if (cfg->starting_level == 0 || cfg->starting_level > PT_MAX_TOP_LEVEL)372372+ return -EINVAL;373373+374374+ if (!pt_feature(&table->common, PT_FEAT_DYNAMIC_TOP) &&375375+ cfg->starting_level != PT_MAX_TOP_LEVEL)376376+ max_vasz_lg2 = PT_GRANULE_LG2SZ +377377+ (PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64))) *378378+ (cfg->starting_level + 1);379379+380380+ table->common.max_vasz_lg2 =381381+ min(max_vasz_lg2, cfg->common.hw_max_vasz_lg2);382382+ table->common.max_oasz_lg2 =383383+ min(PT_MAX_OUTPUT_ADDRESS_LG2, cfg->common.hw_max_oasz_lg2);384384+ pt_top_set_level(&table->common, cfg->starting_level);385385+ return 0;386386+}387387+#define pt_iommu_fmt_init amdv1pt_iommu_fmt_init388388+389389+#ifndef PT_FMT_VARIANT390390+static inline void391391+amdv1pt_iommu_fmt_hw_info(struct pt_iommu_amdv1 *table,392392+ const struct pt_range *top_range,393393+ struct pt_iommu_amdv1_hw_info *info)394394+{395395+ info->host_pt_root = virt_to_phys(top_range->top_table);396396+ PT_WARN_ON(info->host_pt_root & ~PT_TOP_PHYS_MASK);397397+ info->mode = top_range->top_level + 1;398398+}399399+#define pt_iommu_fmt_hw_info amdv1pt_iommu_fmt_hw_info400400+#endif401401+402402+#if defined(GENERIC_PT_KUNIT)403403+static const struct pt_iommu_amdv1_cfg amdv1_kunit_fmt_cfgs[] = {404404+ /* Matches what io_pgtable does */405405+ [0] = { .starting_level = 2 },406406+};407407+#define kunit_fmt_cfgs amdv1_kunit_fmt_cfgs408408+enum { KUNIT_FMT_FEATURES = 0 };409409+#endif410410+411411+#endif
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES44+ *55+ * Intel VT-d Second Stange 5/4 level page table66+ *77+ * This is described in88+ * Section "3.7 Second-Stage Translation"99+ * Section "9.8 Second-Stage Paging Entries"1010+ *1111+ * Of the "Intel Virtualization Technology for Directed I/O Architecture1212+ * Specification".1313+ *1414+ * The named levels in the spec map to the pts->level as:1515+ * Table/SS-PTE - 01616+ * Directory/SS-PDE - 11717+ * Directory Ptr/SS-PDPTE - 21818+ * PML4/SS-PML4E - 31919+ * PML5/SS-PML5E - 42020+ */2121+#ifndef __GENERIC_PT_FMT_VTDSS_H2222+#define __GENERIC_PT_FMT_VTDSS_H2323+2424+#include "defs_vtdss.h"2525+#include "../pt_defs.h"2626+2727+#include <linux/bitfield.h>2828+#include <linux/container_of.h>2929+#include <linux/log2.h>3030+3131+enum {3232+ PT_MAX_OUTPUT_ADDRESS_LG2 = 52,3333+ PT_MAX_VA_ADDRESS_LG2 = 57,3434+ PT_ITEM_WORD_SIZE = sizeof(u64),3535+ PT_MAX_TOP_LEVEL = 4,3636+ PT_GRANULE_LG2SZ = 12,3737+ PT_TABLEMEM_LG2SZ = 12,3838+3939+ /* SSPTPTR is 4k aligned and limited by HAW */4040+ PT_TOP_PHYS_MASK = GENMASK_ULL(63, 12),4141+};4242+4343+/* Shared descriptor bits */4444+enum {4545+ VTDSS_FMT_R = BIT(0),4646+ VTDSS_FMT_W = BIT(1),4747+ VTDSS_FMT_A = BIT(8),4848+ VTDSS_FMT_D = BIT(9),4949+ VTDSS_FMT_SNP = BIT(11),5050+ VTDSS_FMT_OA = GENMASK_ULL(51, 12),5151+};5252+5353+/* PDPTE/PDE */5454+enum {5555+ VTDSS_FMT_PS = BIT(7),5656+};5757+5858+#define common_to_vtdss_pt(common_ptr) \5959+ container_of_const(common_ptr, struct pt_vtdss, common)6060+#define to_vtdss_pt(pts) common_to_vtdss_pt((pts)->range->common)6161+6262+static inline pt_oaddr_t vtdss_pt_table_pa(const struct pt_state *pts)6363+{6464+ return oalog2_mul(FIELD_GET(VTDSS_FMT_OA, pts->entry),6565+ PT_TABLEMEM_LG2SZ);6666+}6767+#define pt_table_pa vtdss_pt_table_pa6868+6969+static inline pt_oaddr_t vtdss_pt_entry_oa(const struct pt_state *pts)7070+{7171+ return oalog2_mul(FIELD_GET(VTDSS_FMT_OA, pts->entry),7272+ PT_GRANULE_LG2SZ);7373+}7474+#define pt_entry_oa vtdss_pt_entry_oa7575+7676+static inline bool vtdss_pt_can_have_leaf(const struct pt_state *pts)7777+{7878+ return pts->level <= 2;7979+}8080+#define pt_can_have_leaf vtdss_pt_can_have_leaf8181+8282+static inline unsigned int vtdss_pt_num_items_lg2(const struct pt_state *pts)8383+{8484+ return PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64));8585+}8686+#define pt_num_items_lg2 vtdss_pt_num_items_lg28787+8888+static inline enum pt_entry_type vtdss_pt_load_entry_raw(struct pt_state *pts)8989+{9090+ const u64 *tablep = pt_cur_table(pts, u64);9191+ u64 entry;9292+9393+ pts->entry = entry = READ_ONCE(tablep[pts->index]);9494+ if (!entry)9595+ return PT_ENTRY_EMPTY;9696+ if (pts->level == 0 ||9797+ (vtdss_pt_can_have_leaf(pts) && (pts->entry & VTDSS_FMT_PS)))9898+ return PT_ENTRY_OA;9999+ return PT_ENTRY_TABLE;100100+}101101+#define pt_load_entry_raw vtdss_pt_load_entry_raw102102+103103+static inline void104104+vtdss_pt_install_leaf_entry(struct pt_state *pts, pt_oaddr_t oa,105105+ unsigned int oasz_lg2,106106+ const struct pt_write_attrs *attrs)107107+{108108+ u64 *tablep = pt_cur_table(pts, u64);109109+ u64 entry;110110+111111+ if (!pt_check_install_leaf_args(pts, oa, oasz_lg2))112112+ return;113113+114114+ entry = FIELD_PREP(VTDSS_FMT_OA, log2_div(oa, PT_GRANULE_LG2SZ)) |115115+ attrs->descriptor_bits;116116+ if (pts->level != 0)117117+ entry |= VTDSS_FMT_PS;118118+119119+ WRITE_ONCE(tablep[pts->index], entry);120120+ pts->entry = entry;121121+}122122+#define pt_install_leaf_entry vtdss_pt_install_leaf_entry123123+124124+static inline bool vtdss_pt_install_table(struct pt_state *pts,125125+ pt_oaddr_t table_pa,126126+ const struct pt_write_attrs *attrs)127127+{128128+ u64 entry;129129+130130+ entry = VTDSS_FMT_R | VTDSS_FMT_W |131131+ FIELD_PREP(VTDSS_FMT_OA, log2_div(table_pa, PT_GRANULE_LG2SZ));132132+ return pt_table_install64(pts, entry);133133+}134134+#define pt_install_table vtdss_pt_install_table135135+136136+static inline void vtdss_pt_attr_from_entry(const struct pt_state *pts,137137+ struct pt_write_attrs *attrs)138138+{139139+ attrs->descriptor_bits = pts->entry &140140+ (VTDSS_FMT_R | VTDSS_FMT_W | VTDSS_FMT_SNP);141141+}142142+#define pt_attr_from_entry vtdss_pt_attr_from_entry143143+144144+static inline bool vtdss_pt_entry_is_write_dirty(const struct pt_state *pts)145145+{146146+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;147147+148148+ return READ_ONCE(*tablep) & VTDSS_FMT_D;149149+}150150+#define pt_entry_is_write_dirty vtdss_pt_entry_is_write_dirty151151+152152+static inline void vtdss_pt_entry_make_write_clean(struct pt_state *pts)153153+{154154+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;155155+156156+ WRITE_ONCE(*tablep, READ_ONCE(*tablep) & ~(u64)VTDSS_FMT_D);157157+}158158+#define pt_entry_make_write_clean vtdss_pt_entry_make_write_clean159159+160160+static inline bool vtdss_pt_entry_make_write_dirty(struct pt_state *pts)161161+{162162+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;163163+ u64 new = pts->entry | VTDSS_FMT_D;164164+165165+ return try_cmpxchg64(tablep, &pts->entry, new);166166+}167167+#define pt_entry_make_write_dirty vtdss_pt_entry_make_write_dirty168168+169169+static inline unsigned int vtdss_pt_max_sw_bit(struct pt_common *common)170170+{171171+ return 10;172172+}173173+#define pt_max_sw_bit vtdss_pt_max_sw_bit174174+175175+static inline u64 vtdss_pt_sw_bit(unsigned int bitnr)176176+{177177+ if (__builtin_constant_p(bitnr) && bitnr > 10)178178+ BUILD_BUG();179179+180180+ /* Bits marked Ignored in the specification */181181+ switch (bitnr) {182182+ case 0:183183+ return BIT(10);184184+ case 1 ... 9:185185+ return BIT_ULL((bitnr - 1) + 52);186186+ case 10:187187+ return BIT_ULL(63);188188+ /* Some bits in 9-3 are available in some entries */189189+ default:190190+ PT_WARN_ON(true);191191+ return 0;192192+ }193193+}194194+#define pt_sw_bit vtdss_pt_sw_bit195195+196196+/* --- iommu */197197+#include <linux/generic_pt/iommu.h>198198+#include <linux/iommu.h>199199+200200+#define pt_iommu_table pt_iommu_vtdss201201+202202+/* The common struct is in the per-format common struct */203203+static inline struct pt_common *common_from_iommu(struct pt_iommu *iommu_table)204204+{205205+ return &container_of(iommu_table, struct pt_iommu_table, iommu)206206+ ->vtdss_pt.common;207207+}208208+209209+static inline struct pt_iommu *iommu_from_common(struct pt_common *common)210210+{211211+ return &container_of(common, struct pt_iommu_table, vtdss_pt.common)212212+ ->iommu;213213+}214214+215215+static inline int vtdss_pt_iommu_set_prot(struct pt_common *common,216216+ struct pt_write_attrs *attrs,217217+ unsigned int iommu_prot)218218+{219219+ u64 pte = 0;220220+221221+ /*222222+ * VTDSS does not have a present bit, so we tell if any entry is present223223+ * by checking for R or W.224224+ */225225+ if (!(iommu_prot & (IOMMU_READ | IOMMU_WRITE)))226226+ return -EINVAL;227227+228228+ if (iommu_prot & IOMMU_READ)229229+ pte |= VTDSS_FMT_R;230230+ if (iommu_prot & IOMMU_WRITE)231231+ pte |= VTDSS_FMT_W;232232+ if (pt_feature(common, PT_FEAT_VTDSS_FORCE_COHERENCE))233233+ pte |= VTDSS_FMT_SNP;234234+235235+ if (pt_feature(common, PT_FEAT_VTDSS_FORCE_WRITEABLE) &&236236+ !(iommu_prot & IOMMU_WRITE)) {237237+ pr_err_ratelimited(238238+ "Read-only mapping is disallowed on the domain which serves as the parent in a nested configuration, due to HW errata (ERRATA_772415_SPR17)\n");239239+ return -EINVAL;240240+ }241241+242242+ attrs->descriptor_bits = pte;243243+ return 0;244244+}245245+#define pt_iommu_set_prot vtdss_pt_iommu_set_prot246246+247247+static inline int vtdss_pt_iommu_fmt_init(struct pt_iommu_vtdss *iommu_table,248248+ const struct pt_iommu_vtdss_cfg *cfg)249249+{250250+ struct pt_vtdss *table = &iommu_table->vtdss_pt;251251+252252+ if (cfg->top_level > 4 || cfg->top_level < 2)253253+ return -EOPNOTSUPP;254254+255255+ pt_top_set_level(&table->common, cfg->top_level);256256+ return 0;257257+}258258+#define pt_iommu_fmt_init vtdss_pt_iommu_fmt_init259259+260260+static inline void261261+vtdss_pt_iommu_fmt_hw_info(struct pt_iommu_vtdss *table,262262+ const struct pt_range *top_range,263263+ struct pt_iommu_vtdss_hw_info *info)264264+{265265+ info->ssptptr = virt_to_phys(top_range->top_table);266266+ PT_WARN_ON(info->ssptptr & ~PT_TOP_PHYS_MASK);267267+ /*268268+ * top_level = 2 = 3 level table aw=1269269+ * top_level = 3 = 4 level table aw=2270270+ * top_level = 4 = 5 level table aw=3271271+ */272272+ info->aw = top_range->top_level - 1;273273+}274274+#define pt_iommu_fmt_hw_info vtdss_pt_iommu_fmt_hw_info275275+276276+#if defined(GENERIC_PT_KUNIT)277277+static const struct pt_iommu_vtdss_cfg vtdss_kunit_fmt_cfgs[] = {278278+ [0] = { .common.hw_max_vasz_lg2 = 39, .top_level = 2},279279+ [1] = { .common.hw_max_vasz_lg2 = 48, .top_level = 3},280280+ [2] = { .common.hw_max_vasz_lg2 = 57, .top_level = 4},281281+};282282+#define kunit_fmt_cfgs vtdss_kunit_fmt_cfgs283283+enum { KUNIT_FMT_FEATURES = BIT(PT_FEAT_VTDSS_FORCE_WRITEABLE) };284284+#endif285285+#endif
+279
drivers/iommu/generic_pt/fmt/x86_64.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * x86 page table. Supports the 4 and 5 level variations.66+ *77+ * The 4 and 5 level version is described in:88+ * Section "4.4 4-Level Paging and 5-Level Paging" of the Intel Software99+ * Developer's Manual Volume 31010+ *1111+ * Section "9.7 First-Stage Paging Entries" of the "Intel Virtualization1212+ * Technology for Directed I/O Architecture Specification"1313+ *1414+ * Section "2.2.6 I/O Page Tables for Guest Translations" of the "AMD I/O1515+ * Virtualization Technology (IOMMU) Specification"1616+ *1717+ * It is used by x86 CPUs, AMD and VT-d IOMMU HW.1818+ *1919+ * Note the 3 level format is very similar and almost implemented here. The2020+ * reserved/ignored layout is different and there are functional bit2121+ * differences.2222+ *2323+ * This format uses PT_FEAT_SIGN_EXTEND to have a upper/non-canonical/lower2424+ * split. PT_FEAT_SIGN_EXTEND is optional as AMD IOMMU sometimes uses non-sign2525+ * extended addressing with this page table format.2626+ *2727+ * The named levels in the spec map to the pts->level as:2828+ * Table/PTE - 02929+ * Directory/PDE - 13030+ * Directory Ptr/PDPTE - 23131+ * PML4/PML4E - 33232+ * PML5/PML5E - 43333+ */3434+#ifndef __GENERIC_PT_FMT_X86_64_H3535+#define __GENERIC_PT_FMT_X86_64_H3636+3737+#include "defs_x86_64.h"3838+#include "../pt_defs.h"3939+4040+#include <linux/bitfield.h>4141+#include <linux/container_of.h>4242+#include <linux/log2.h>4343+#include <linux/mem_encrypt.h>4444+4545+enum {4646+ PT_MAX_OUTPUT_ADDRESS_LG2 = 52,4747+ PT_MAX_VA_ADDRESS_LG2 = 57,4848+ PT_ITEM_WORD_SIZE = sizeof(u64),4949+ PT_MAX_TOP_LEVEL = 4,5050+ PT_GRANULE_LG2SZ = 12,5151+ PT_TABLEMEM_LG2SZ = 12,5252+5353+ /*5454+ * For AMD the GCR3 Base only has these bits. For VT-d FSPTPTR is 4k5555+ * aligned and is limited by the architected HAW5656+ */5757+ PT_TOP_PHYS_MASK = GENMASK_ULL(51, 12),5858+};5959+6060+/* Shared descriptor bits */6161+enum {6262+ X86_64_FMT_P = BIT(0),6363+ X86_64_FMT_RW = BIT(1),6464+ X86_64_FMT_U = BIT(2),6565+ X86_64_FMT_A = BIT(5),6666+ X86_64_FMT_D = BIT(6),6767+ X86_64_FMT_OA = GENMASK_ULL(51, 12),6868+ X86_64_FMT_XD = BIT_ULL(63),6969+};7070+7171+/* PDPTE/PDE */7272+enum {7373+ X86_64_FMT_PS = BIT(7),7474+};7575+7676+static inline pt_oaddr_t x86_64_pt_table_pa(const struct pt_state *pts)7777+{7878+ u64 entry = pts->entry;7979+8080+ if (pts_feature(pts, PT_FEAT_X86_64_AMD_ENCRYPT_TABLES))8181+ entry = __sme_clr(entry);8282+ return oalog2_mul(FIELD_GET(X86_64_FMT_OA, entry),8383+ PT_TABLEMEM_LG2SZ);8484+}8585+#define pt_table_pa x86_64_pt_table_pa8686+8787+static inline pt_oaddr_t x86_64_pt_entry_oa(const struct pt_state *pts)8888+{8989+ u64 entry = pts->entry;9090+9191+ if (pts_feature(pts, PT_FEAT_X86_64_AMD_ENCRYPT_TABLES))9292+ entry = __sme_clr(entry);9393+ return oalog2_mul(FIELD_GET(X86_64_FMT_OA, entry),9494+ PT_GRANULE_LG2SZ);9595+}9696+#define pt_entry_oa x86_64_pt_entry_oa9797+9898+static inline bool x86_64_pt_can_have_leaf(const struct pt_state *pts)9999+{100100+ return pts->level <= 2;101101+}102102+#define pt_can_have_leaf x86_64_pt_can_have_leaf103103+104104+static inline unsigned int x86_64_pt_num_items_lg2(const struct pt_state *pts)105105+{106106+ return PT_TABLEMEM_LG2SZ - ilog2(sizeof(u64));107107+}108108+#define pt_num_items_lg2 x86_64_pt_num_items_lg2109109+110110+static inline enum pt_entry_type x86_64_pt_load_entry_raw(struct pt_state *pts)111111+{112112+ const u64 *tablep = pt_cur_table(pts, u64);113113+ u64 entry;114114+115115+ pts->entry = entry = READ_ONCE(tablep[pts->index]);116116+ if (!(entry & X86_64_FMT_P))117117+ return PT_ENTRY_EMPTY;118118+ if (pts->level == 0 ||119119+ (x86_64_pt_can_have_leaf(pts) && (entry & X86_64_FMT_PS)))120120+ return PT_ENTRY_OA;121121+ return PT_ENTRY_TABLE;122122+}123123+#define pt_load_entry_raw x86_64_pt_load_entry_raw124124+125125+static inline void126126+x86_64_pt_install_leaf_entry(struct pt_state *pts, pt_oaddr_t oa,127127+ unsigned int oasz_lg2,128128+ const struct pt_write_attrs *attrs)129129+{130130+ u64 *tablep = pt_cur_table(pts, u64);131131+ u64 entry;132132+133133+ if (!pt_check_install_leaf_args(pts, oa, oasz_lg2))134134+ return;135135+136136+ entry = X86_64_FMT_P |137137+ FIELD_PREP(X86_64_FMT_OA, log2_div(oa, PT_GRANULE_LG2SZ)) |138138+ attrs->descriptor_bits;139139+ if (pts->level != 0)140140+ entry |= X86_64_FMT_PS;141141+142142+ WRITE_ONCE(tablep[pts->index], entry);143143+ pts->entry = entry;144144+}145145+#define pt_install_leaf_entry x86_64_pt_install_leaf_entry146146+147147+static inline bool x86_64_pt_install_table(struct pt_state *pts,148148+ pt_oaddr_t table_pa,149149+ const struct pt_write_attrs *attrs)150150+{151151+ u64 entry;152152+153153+ entry = X86_64_FMT_P | X86_64_FMT_RW | X86_64_FMT_U | X86_64_FMT_A |154154+ FIELD_PREP(X86_64_FMT_OA, log2_div(table_pa, PT_GRANULE_LG2SZ));155155+ if (pts_feature(pts, PT_FEAT_X86_64_AMD_ENCRYPT_TABLES))156156+ entry = __sme_set(entry);157157+ return pt_table_install64(pts, entry);158158+}159159+#define pt_install_table x86_64_pt_install_table160160+161161+static inline void x86_64_pt_attr_from_entry(const struct pt_state *pts,162162+ struct pt_write_attrs *attrs)163163+{164164+ attrs->descriptor_bits = pts->entry &165165+ (X86_64_FMT_RW | X86_64_FMT_U | X86_64_FMT_A |166166+ X86_64_FMT_D | X86_64_FMT_XD);167167+}168168+#define pt_attr_from_entry x86_64_pt_attr_from_entry169169+170170+static inline unsigned int x86_64_pt_max_sw_bit(struct pt_common *common)171171+{172172+ return 12;173173+}174174+#define pt_max_sw_bit x86_64_pt_max_sw_bit175175+176176+static inline u64 x86_64_pt_sw_bit(unsigned int bitnr)177177+{178178+ if (__builtin_constant_p(bitnr) && bitnr > 12)179179+ BUILD_BUG();180180+181181+ /* Bits marked Ignored/AVL in the specification */182182+ switch (bitnr) {183183+ case 0:184184+ return BIT(9);185185+ case 1:186186+ return BIT(11);187187+ case 2 ... 12:188188+ return BIT_ULL((bitnr - 2) + 52);189189+ /* Some bits in 8,6,4,3 are available in some entries */190190+ default:191191+ PT_WARN_ON(true);192192+ return 0;193193+ }194194+}195195+#define pt_sw_bit x86_64_pt_sw_bit196196+197197+/* --- iommu */198198+#include <linux/generic_pt/iommu.h>199199+#include <linux/iommu.h>200200+201201+#define pt_iommu_table pt_iommu_x86_64202202+203203+/* The common struct is in the per-format common struct */204204+static inline struct pt_common *common_from_iommu(struct pt_iommu *iommu_table)205205+{206206+ return &container_of(iommu_table, struct pt_iommu_table, iommu)207207+ ->x86_64_pt.common;208208+}209209+210210+static inline struct pt_iommu *iommu_from_common(struct pt_common *common)211211+{212212+ return &container_of(common, struct pt_iommu_table, x86_64_pt.common)213213+ ->iommu;214214+}215215+216216+static inline int x86_64_pt_iommu_set_prot(struct pt_common *common,217217+ struct pt_write_attrs *attrs,218218+ unsigned int iommu_prot)219219+{220220+ u64 pte;221221+222222+ pte = X86_64_FMT_U | X86_64_FMT_A;223223+ if (iommu_prot & IOMMU_WRITE)224224+ pte |= X86_64_FMT_RW | X86_64_FMT_D;225225+226226+ /*227227+ * Ideally we'd have an IOMMU_ENCRYPTED flag set by higher levels to228228+ * control this. For now if the tables use sme_set then so do the ptes.229229+ */230230+ if (pt_feature(common, PT_FEAT_X86_64_AMD_ENCRYPT_TABLES))231231+ pte = __sme_set(pte);232232+233233+ attrs->descriptor_bits = pte;234234+ return 0;235235+}236236+#define pt_iommu_set_prot x86_64_pt_iommu_set_prot237237+238238+static inline int239239+x86_64_pt_iommu_fmt_init(struct pt_iommu_x86_64 *iommu_table,240240+ const struct pt_iommu_x86_64_cfg *cfg)241241+{242242+ struct pt_x86_64 *table = &iommu_table->x86_64_pt;243243+244244+ if (cfg->top_level < 3 || cfg->top_level > 4)245245+ return -EOPNOTSUPP;246246+247247+ pt_top_set_level(&table->common, cfg->top_level);248248+249249+ table->common.max_oasz_lg2 =250250+ min(PT_MAX_OUTPUT_ADDRESS_LG2, cfg->common.hw_max_oasz_lg2);251251+ return 0;252252+}253253+#define pt_iommu_fmt_init x86_64_pt_iommu_fmt_init254254+255255+static inline void256256+x86_64_pt_iommu_fmt_hw_info(struct pt_iommu_x86_64 *table,257257+ const struct pt_range *top_range,258258+ struct pt_iommu_x86_64_hw_info *info)259259+{260260+ info->gcr3_pt = virt_to_phys(top_range->top_table);261261+ PT_WARN_ON(info->gcr3_pt & ~PT_TOP_PHYS_MASK);262262+ info->levels = top_range->top_level + 1;263263+}264264+#define pt_iommu_fmt_hw_info x86_64_pt_iommu_fmt_hw_info265265+266266+#if defined(GENERIC_PT_KUNIT)267267+static const struct pt_iommu_x86_64_cfg x86_64_kunit_fmt_cfgs[] = {268268+ [0] = { .common.features = BIT(PT_FEAT_SIGN_EXTEND),269269+ .common.hw_max_vasz_lg2 = 48, .top_level = 3 },270270+ [1] = { .common.features = BIT(PT_FEAT_SIGN_EXTEND),271271+ .common.hw_max_vasz_lg2 = 57, .top_level = 4 },272272+ /* AMD IOMMU PASID 0 formats with no SIGN_EXTEND */273273+ [2] = { .common.hw_max_vasz_lg2 = 47, .top_level = 3 },274274+ [3] = { .common.hw_max_vasz_lg2 = 56, .top_level = 4},275275+};276276+#define kunit_fmt_cfgs x86_64_kunit_fmt_cfgs277277+enum { KUNIT_FMT_FEATURES = BIT(PT_FEAT_SIGN_EXTEND)};278278+#endif279279+#endif
+1289
drivers/iommu/generic_pt/iommu_pt.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * "Templated C code" for implementing the iommu operations for page tables.66+ * This is compiled multiple times, over all the page table formats to pick up77+ * the per-format definitions.88+ */99+#ifndef __GENERIC_PT_IOMMU_PT_H1010+#define __GENERIC_PT_IOMMU_PT_H1111+1212+#include "pt_iter.h"1313+1414+#include <linux/export.h>1515+#include <linux/iommu.h>1616+#include "../iommu-pages.h"1717+#include <linux/cleanup.h>1818+#include <linux/dma-mapping.h>1919+2020+enum {2121+ SW_BIT_CACHE_FLUSH_DONE = 0,2222+};2323+2424+static void flush_writes_range(const struct pt_state *pts,2525+ unsigned int start_index, unsigned int end_index)2626+{2727+ if (pts_feature(pts, PT_FEAT_DMA_INCOHERENT))2828+ iommu_pages_flush_incoherent(2929+ iommu_from_common(pts->range->common)->iommu_device,3030+ pts->table, start_index * PT_ITEM_WORD_SIZE,3131+ (end_index - start_index) * PT_ITEM_WORD_SIZE);3232+}3333+3434+static void flush_writes_item(const struct pt_state *pts)3535+{3636+ if (pts_feature(pts, PT_FEAT_DMA_INCOHERENT))3737+ iommu_pages_flush_incoherent(3838+ iommu_from_common(pts->range->common)->iommu_device,3939+ pts->table, pts->index * PT_ITEM_WORD_SIZE,4040+ PT_ITEM_WORD_SIZE);4141+}4242+4343+static void gather_range_pages(struct iommu_iotlb_gather *iotlb_gather,4444+ struct pt_iommu *iommu_table, pt_vaddr_t iova,4545+ pt_vaddr_t len,4646+ struct iommu_pages_list *free_list)4747+{4848+ struct pt_common *common = common_from_iommu(iommu_table);4949+5050+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT))5151+ iommu_pages_stop_incoherent_list(free_list,5252+ iommu_table->iommu_device);5353+5454+ if (pt_feature(common, PT_FEAT_FLUSH_RANGE_NO_GAPS) &&5555+ iommu_iotlb_gather_is_disjoint(iotlb_gather, iova, len)) {5656+ iommu_iotlb_sync(&iommu_table->domain, iotlb_gather);5757+ /*5858+ * Note that the sync frees the gather's free list, so we must5959+ * not have any pages on that list that are covered by iova/len6060+ */6161+ } else if (pt_feature(common, PT_FEAT_FLUSH_RANGE)) {6262+ iommu_iotlb_gather_add_range(iotlb_gather, iova, len);6363+ }6464+6565+ iommu_pages_list_splice(free_list, &iotlb_gather->freelist);6666+}6767+6868+#define DOMAIN_NS(op) CONCATENATE(CONCATENATE(pt_iommu_, PTPFX), op)6969+7070+static int make_range_ul(struct pt_common *common, struct pt_range *range,7171+ unsigned long iova, unsigned long len)7272+{7373+ unsigned long last;7474+7575+ if (unlikely(len == 0))7676+ return -EINVAL;7777+7878+ if (check_add_overflow(iova, len - 1, &last))7979+ return -EOVERFLOW;8080+8181+ *range = pt_make_range(common, iova, last);8282+ if (sizeof(iova) > sizeof(range->va)) {8383+ if (unlikely(range->va != iova || range->last_va != last))8484+ return -EOVERFLOW;8585+ }8686+ return 0;8787+}8888+8989+static __maybe_unused int make_range_u64(struct pt_common *common,9090+ struct pt_range *range, u64 iova,9191+ u64 len)9292+{9393+ if (unlikely(iova > ULONG_MAX || len > ULONG_MAX))9494+ return -EOVERFLOW;9595+ return make_range_ul(common, range, iova, len);9696+}9797+9898+/*9999+ * Some APIs use unsigned long, while othersuse dma_addr_t as the type. Dispatch100100+ * to the correct validation based on the type.101101+ */102102+#define make_range_no_check(common, range, iova, len) \103103+ ({ \104104+ int ret; \105105+ if (sizeof(iova) > sizeof(unsigned long) || \106106+ sizeof(len) > sizeof(unsigned long)) \107107+ ret = make_range_u64(common, range, iova, len); \108108+ else \109109+ ret = make_range_ul(common, range, iova, len); \110110+ ret; \111111+ })112112+113113+#define make_range(common, range, iova, len) \114114+ ({ \115115+ int ret = make_range_no_check(common, range, iova, len); \116116+ if (!ret) \117117+ ret = pt_check_range(range); \118118+ ret; \119119+ })120120+121121+static inline unsigned int compute_best_pgsize(struct pt_state *pts,122122+ pt_oaddr_t oa)123123+{124124+ struct pt_iommu *iommu_table = iommu_from_common(pts->range->common);125125+126126+ if (!pt_can_have_leaf(pts))127127+ return 0;128128+129129+ /*130130+ * The page size is limited by the domain's bitmap. This allows the core131131+ * code to reduce the supported page sizes by changing the bitmap.132132+ */133133+ return pt_compute_best_pgsize(pt_possible_sizes(pts) &134134+ iommu_table->domain.pgsize_bitmap,135135+ pts->range->va, pts->range->last_va, oa);136136+}137137+138138+static __always_inline int __do_iova_to_phys(struct pt_range *range, void *arg,139139+ unsigned int level,140140+ struct pt_table_p *table,141141+ pt_level_fn_t descend_fn)142142+{143143+ struct pt_state pts = pt_init(range, level, table);144144+ pt_oaddr_t *res = arg;145145+146146+ switch (pt_load_single_entry(&pts)) {147147+ case PT_ENTRY_EMPTY:148148+ return -ENOENT;149149+ case PT_ENTRY_TABLE:150150+ return pt_descend(&pts, arg, descend_fn);151151+ case PT_ENTRY_OA:152152+ *res = pt_entry_oa_exact(&pts);153153+ return 0;154154+ }155155+ return -ENOENT;156156+}157157+PT_MAKE_LEVELS(__iova_to_phys, __do_iova_to_phys);158158+159159+/**160160+ * iova_to_phys() - Return the output address for the given IOVA161161+ * @domain: Table to query162162+ * @iova: IO virtual address to query163163+ *164164+ * Determine the output address from the given IOVA. @iova may have any165165+ * alignment, the returned physical will be adjusted with any sub page offset.166166+ *167167+ * Context: The caller must hold a read range lock that includes @iova.168168+ *169169+ * Return: 0 if there is no translation for the given iova.170170+ */171171+phys_addr_t DOMAIN_NS(iova_to_phys)(struct iommu_domain *domain,172172+ dma_addr_t iova)173173+{174174+ struct pt_iommu *iommu_table =175175+ container_of(domain, struct pt_iommu, domain);176176+ struct pt_range range;177177+ pt_oaddr_t res;178178+ int ret;179179+180180+ ret = make_range(common_from_iommu(iommu_table), &range, iova, 1);181181+ if (ret)182182+ return ret;183183+184184+ ret = pt_walk_range(&range, __iova_to_phys, &res);185185+ /* PHYS_ADDR_MAX would be a better error code */186186+ if (ret)187187+ return 0;188188+ return res;189189+}190190+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(iova_to_phys), "GENERIC_PT_IOMMU");191191+192192+struct pt_iommu_dirty_args {193193+ struct iommu_dirty_bitmap *dirty;194194+ unsigned int flags;195195+};196196+197197+static void record_dirty(struct pt_state *pts,198198+ struct pt_iommu_dirty_args *dirty,199199+ unsigned int num_contig_lg2)200200+{201201+ pt_vaddr_t dirty_len;202202+203203+ if (num_contig_lg2 != ilog2(1)) {204204+ unsigned int index = pts->index;205205+ unsigned int end_index = log2_set_mod_max_t(206206+ unsigned int, pts->index, num_contig_lg2);207207+208208+ /* Adjust for being contained inside a contiguous page */209209+ end_index = min(end_index, pts->end_index);210210+ dirty_len = (end_index - index) *211211+ log2_to_int(pt_table_item_lg2sz(pts));212212+ } else {213213+ dirty_len = log2_to_int(pt_table_item_lg2sz(pts));214214+ }215215+216216+ if (dirty->dirty->bitmap)217217+ iova_bitmap_set(dirty->dirty->bitmap, pts->range->va,218218+ dirty_len);219219+220220+ if (!(dirty->flags & IOMMU_DIRTY_NO_CLEAR)) {221221+ /*222222+ * No write log required because DMA incoherence and atomic223223+ * dirty tracking bits can't work together224224+ */225225+ pt_entry_make_write_clean(pts);226226+ iommu_iotlb_gather_add_range(dirty->dirty->gather,227227+ pts->range->va, dirty_len);228228+ }229229+}230230+231231+static inline int __read_and_clear_dirty(struct pt_range *range, void *arg,232232+ unsigned int level,233233+ struct pt_table_p *table)234234+{235235+ struct pt_state pts = pt_init(range, level, table);236236+ struct pt_iommu_dirty_args *dirty = arg;237237+ int ret;238238+239239+ for_each_pt_level_entry(&pts) {240240+ if (pts.type == PT_ENTRY_TABLE) {241241+ ret = pt_descend(&pts, arg, __read_and_clear_dirty);242242+ if (ret)243243+ return ret;244244+ continue;245245+ }246246+ if (pts.type == PT_ENTRY_OA && pt_entry_is_write_dirty(&pts))247247+ record_dirty(&pts, dirty,248248+ pt_entry_num_contig_lg2(&pts));249249+ }250250+ return 0;251251+}252252+253253+/**254254+ * read_and_clear_dirty() - Manipulate the HW set write dirty state255255+ * @domain: Domain to manipulate256256+ * @iova: IO virtual address to start257257+ * @size: Length of the IOVA258258+ * @flags: A bitmap of IOMMU_DIRTY_NO_CLEAR259259+ * @dirty: Place to store the dirty bits260260+ *261261+ * Iterate over all the entries in the mapped range and record their write dirty262262+ * status in iommu_dirty_bitmap. If IOMMU_DIRTY_NO_CLEAR is not specified then263263+ * the entries will be left dirty, otherwise they are returned to being not264264+ * write dirty.265265+ *266266+ * Context: The caller must hold a read range lock that includes @iova.267267+ *268268+ * Returns: -ERRNO on failure, 0 on success.269269+ */270270+int DOMAIN_NS(read_and_clear_dirty)(struct iommu_domain *domain,271271+ unsigned long iova, size_t size,272272+ unsigned long flags,273273+ struct iommu_dirty_bitmap *dirty)274274+{275275+ struct pt_iommu *iommu_table =276276+ container_of(domain, struct pt_iommu, domain);277277+ struct pt_iommu_dirty_args dirty_args = {278278+ .dirty = dirty,279279+ .flags = flags,280280+ };281281+ struct pt_range range;282282+ int ret;283283+284284+#if !IS_ENABLED(CONFIG_IOMMUFD_DRIVER) || !defined(pt_entry_is_write_dirty)285285+ return -EOPNOTSUPP;286286+#endif287287+288288+ ret = make_range(common_from_iommu(iommu_table), &range, iova, size);289289+ if (ret)290290+ return ret;291291+292292+ ret = pt_walk_range(&range, __read_and_clear_dirty, &dirty_args);293293+ PT_WARN_ON(ret);294294+ return ret;295295+}296296+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(read_and_clear_dirty), "GENERIC_PT_IOMMU");297297+298298+static inline int __set_dirty(struct pt_range *range, void *arg,299299+ unsigned int level, struct pt_table_p *table)300300+{301301+ struct pt_state pts = pt_init(range, level, table);302302+303303+ switch (pt_load_single_entry(&pts)) {304304+ case PT_ENTRY_EMPTY:305305+ return -ENOENT;306306+ case PT_ENTRY_TABLE:307307+ return pt_descend(&pts, arg, __set_dirty);308308+ case PT_ENTRY_OA:309309+ if (!pt_entry_make_write_dirty(&pts))310310+ return -EAGAIN;311311+ return 0;312312+ }313313+ return -ENOENT;314314+}315315+316316+static int __maybe_unused NS(set_dirty)(struct pt_iommu *iommu_table,317317+ dma_addr_t iova)318318+{319319+ struct pt_range range;320320+ int ret;321321+322322+ ret = make_range(common_from_iommu(iommu_table), &range, iova, 1);323323+ if (ret)324324+ return ret;325325+326326+ /*327327+ * Note: There is no locking here yet, if the test suite races this it328328+ * can crash. It should use RCU locking eventually.329329+ */330330+ return pt_walk_range(&range, __set_dirty, NULL);331331+}332332+333333+struct pt_iommu_collect_args {334334+ struct iommu_pages_list free_list;335335+ /* Fail if any OAs are within the range */336336+ u8 check_mapped : 1;337337+};338338+339339+static int __collect_tables(struct pt_range *range, void *arg,340340+ unsigned int level, struct pt_table_p *table)341341+{342342+ struct pt_state pts = pt_init(range, level, table);343343+ struct pt_iommu_collect_args *collect = arg;344344+ int ret;345345+346346+ if (!collect->check_mapped && !pt_can_have_table(&pts))347347+ return 0;348348+349349+ for_each_pt_level_entry(&pts) {350350+ if (pts.type == PT_ENTRY_TABLE) {351351+ iommu_pages_list_add(&collect->free_list, pts.table_lower);352352+ ret = pt_descend(&pts, arg, __collect_tables);353353+ if (ret)354354+ return ret;355355+ continue;356356+ }357357+ if (pts.type == PT_ENTRY_OA && collect->check_mapped)358358+ return -EADDRINUSE;359359+ }360360+ return 0;361361+}362362+363363+enum alloc_mode {ALLOC_NORMAL, ALLOC_DEFER_COHERENT_FLUSH};364364+365365+/* Allocate a table, the empty table will be ready to be installed. */366366+static inline struct pt_table_p *_table_alloc(struct pt_common *common,367367+ size_t lg2sz, gfp_t gfp,368368+ enum alloc_mode mode)369369+{370370+ struct pt_iommu *iommu_table = iommu_from_common(common);371371+ struct pt_table_p *table_mem;372372+373373+ table_mem = iommu_alloc_pages_node_sz(iommu_table->nid, gfp,374374+ log2_to_int(lg2sz));375375+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT) &&376376+ mode == ALLOC_NORMAL) {377377+ int ret = iommu_pages_start_incoherent(378378+ table_mem, iommu_table->iommu_device);379379+ if (ret) {380380+ iommu_free_pages(table_mem);381381+ return ERR_PTR(ret);382382+ }383383+ }384384+ return table_mem;385385+}386386+387387+static inline struct pt_table_p *table_alloc_top(struct pt_common *common,388388+ uintptr_t top_of_table,389389+ gfp_t gfp,390390+ enum alloc_mode mode)391391+{392392+ /*393393+ * Top doesn't need the free list or otherwise, so it technically394394+ * doesn't need to use iommu pages. Use the API anyhow as the top is395395+ * usually not smaller than PAGE_SIZE to keep things simple.396396+ */397397+ return _table_alloc(common, pt_top_memsize_lg2(common, top_of_table),398398+ gfp, mode);399399+}400400+401401+/* Allocate an interior table */402402+static inline struct pt_table_p *table_alloc(const struct pt_state *parent_pts,403403+ gfp_t gfp, enum alloc_mode mode)404404+{405405+ struct pt_state child_pts =406406+ pt_init(parent_pts->range, parent_pts->level - 1, NULL);407407+408408+ return _table_alloc(parent_pts->range->common,409409+ pt_num_items_lg2(&child_pts) +410410+ ilog2(PT_ITEM_WORD_SIZE),411411+ gfp, mode);412412+}413413+414414+static inline int pt_iommu_new_table(struct pt_state *pts,415415+ struct pt_write_attrs *attrs)416416+{417417+ struct pt_table_p *table_mem;418418+ phys_addr_t phys;419419+420420+ /* Given PA/VA/length can't be represented */421421+ if (PT_WARN_ON(!pt_can_have_table(pts)))422422+ return -ENXIO;423423+424424+ table_mem = table_alloc(pts, attrs->gfp, ALLOC_NORMAL);425425+ if (IS_ERR(table_mem))426426+ return PTR_ERR(table_mem);427427+428428+ phys = virt_to_phys(table_mem);429429+ if (!pt_install_table(pts, phys, attrs)) {430430+ iommu_pages_free_incoherent(431431+ table_mem,432432+ iommu_from_common(pts->range->common)->iommu_device);433433+ return -EAGAIN;434434+ }435435+436436+ if (pts_feature(pts, PT_FEAT_DMA_INCOHERENT)) {437437+ flush_writes_item(pts);438438+ pt_set_sw_bit_release(pts, SW_BIT_CACHE_FLUSH_DONE);439439+ }440440+441441+ if (IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)) {442442+ /*443443+ * The underlying table can't store the physical table address.444444+ * This happens when kunit testing tables outside their normal445445+ * environment where a CPU might be limited.446446+ */447447+ pt_load_single_entry(pts);448448+ if (PT_WARN_ON(pt_table_pa(pts) != phys)) {449449+ pt_clear_entries(pts, ilog2(1));450450+ iommu_pages_free_incoherent(451451+ table_mem, iommu_from_common(pts->range->common)452452+ ->iommu_device);453453+ return -EINVAL;454454+ }455455+ }456456+457457+ pts->table_lower = table_mem;458458+ return 0;459459+}460460+461461+struct pt_iommu_map_args {462462+ struct iommu_iotlb_gather *iotlb_gather;463463+ struct pt_write_attrs attrs;464464+ pt_oaddr_t oa;465465+ unsigned int leaf_pgsize_lg2;466466+ unsigned int leaf_level;467467+};468468+469469+/*470470+ * This will recursively check any tables in the block to validate they are471471+ * empty and then free them through the gather.472472+ */473473+static int clear_contig(const struct pt_state *start_pts,474474+ struct iommu_iotlb_gather *iotlb_gather,475475+ unsigned int step, unsigned int pgsize_lg2)476476+{477477+ struct pt_iommu *iommu_table =478478+ iommu_from_common(start_pts->range->common);479479+ struct pt_range range = *start_pts->range;480480+ struct pt_state pts =481481+ pt_init(&range, start_pts->level, start_pts->table);482482+ struct pt_iommu_collect_args collect = { .check_mapped = true };483483+ int ret;484484+485485+ pts.index = start_pts->index;486486+ pts.end_index = start_pts->index + step;487487+ for (; _pt_iter_load(&pts); pt_next_entry(&pts)) {488488+ if (pts.type == PT_ENTRY_TABLE) {489489+ collect.free_list =490490+ IOMMU_PAGES_LIST_INIT(collect.free_list);491491+ ret = pt_walk_descend_all(&pts, __collect_tables,492492+ &collect);493493+ if (ret)494494+ return ret;495495+496496+ /*497497+ * The table item must be cleared before we can update498498+ * the gather499499+ */500500+ pt_clear_entries(&pts, ilog2(1));501501+ flush_writes_item(&pts);502502+503503+ iommu_pages_list_add(&collect.free_list,504504+ pt_table_ptr(&pts));505505+ gather_range_pages(506506+ iotlb_gather, iommu_table, range.va,507507+ log2_to_int(pt_table_item_lg2sz(&pts)),508508+ &collect.free_list);509509+ } else if (pts.type != PT_ENTRY_EMPTY) {510510+ return -EADDRINUSE;511511+ }512512+ }513513+ return 0;514514+}515515+516516+static int __map_range_leaf(struct pt_range *range, void *arg,517517+ unsigned int level, struct pt_table_p *table)518518+{519519+ struct pt_state pts = pt_init(range, level, table);520520+ struct pt_iommu_map_args *map = arg;521521+ unsigned int leaf_pgsize_lg2 = map->leaf_pgsize_lg2;522522+ unsigned int start_index;523523+ pt_oaddr_t oa = map->oa;524524+ unsigned int step;525525+ bool need_contig;526526+ int ret = 0;527527+528528+ PT_WARN_ON(map->leaf_level != level);529529+ PT_WARN_ON(!pt_can_have_leaf(&pts));530530+531531+ step = log2_to_int_t(unsigned int,532532+ leaf_pgsize_lg2 - pt_table_item_lg2sz(&pts));533533+ need_contig = leaf_pgsize_lg2 != pt_table_item_lg2sz(&pts);534534+535535+ _pt_iter_first(&pts);536536+ start_index = pts.index;537537+ do {538538+ pts.type = pt_load_entry_raw(&pts);539539+ if (pts.type != PT_ENTRY_EMPTY || need_contig) {540540+ if (pts.index != start_index)541541+ pt_index_to_va(&pts);542542+ ret = clear_contig(&pts, map->iotlb_gather, step,543543+ leaf_pgsize_lg2);544544+ if (ret)545545+ break;546546+ }547547+548548+ if (IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)) {549549+ pt_index_to_va(&pts);550550+ PT_WARN_ON(compute_best_pgsize(&pts, oa) !=551551+ leaf_pgsize_lg2);552552+ }553553+ pt_install_leaf_entry(&pts, oa, leaf_pgsize_lg2, &map->attrs);554554+555555+ oa += log2_to_int(leaf_pgsize_lg2);556556+ pts.index += step;557557+ } while (pts.index < pts.end_index);558558+559559+ flush_writes_range(&pts, start_index, pts.index);560560+561561+ map->oa = oa;562562+ return ret;563563+}564564+565565+static int __map_range(struct pt_range *range, void *arg, unsigned int level,566566+ struct pt_table_p *table)567567+{568568+ struct pt_state pts = pt_init(range, level, table);569569+ struct pt_iommu_map_args *map = arg;570570+ int ret;571571+572572+ PT_WARN_ON(map->leaf_level == level);573573+ PT_WARN_ON(!pt_can_have_table(&pts));574574+575575+ _pt_iter_first(&pts);576576+577577+ /* Descend to a child table */578578+ do {579579+ pts.type = pt_load_entry_raw(&pts);580580+581581+ if (pts.type != PT_ENTRY_TABLE) {582582+ if (pts.type != PT_ENTRY_EMPTY)583583+ return -EADDRINUSE;584584+ ret = pt_iommu_new_table(&pts, &map->attrs);585585+ if (ret) {586586+ /*587587+ * Racing with another thread installing a table588588+ */589589+ if (ret == -EAGAIN)590590+ continue;591591+ return ret;592592+ }593593+ } else {594594+ pts.table_lower = pt_table_ptr(&pts);595595+ /*596596+ * Racing with a shared pt_iommu_new_table()? The other597597+ * thread is still flushing the cache, so we have to598598+ * also flush it to ensure that when our thread's map599599+ * completes all the table items leading to our mapping600600+ * are visible.601601+ *602602+ * This requires the pt_set_bit_release() to be a603603+ * release of the cache flush so that this can acquire604604+ * visibility at the iommu.605605+ */606606+ if (pts_feature(&pts, PT_FEAT_DMA_INCOHERENT) &&607607+ !pt_test_sw_bit_acquire(&pts,608608+ SW_BIT_CACHE_FLUSH_DONE))609609+ flush_writes_item(&pts);610610+ }611611+612612+ /*613613+ * The already present table can possibly be shared with another614614+ * concurrent map.615615+ */616616+ if (map->leaf_level == level - 1)617617+ ret = pt_descend(&pts, arg, __map_range_leaf);618618+ else619619+ ret = pt_descend(&pts, arg, __map_range);620620+ if (ret)621621+ return ret;622622+623623+ pts.index++;624624+ pt_index_to_va(&pts);625625+ if (pts.index >= pts.end_index)626626+ break;627627+ } while (true);628628+ return 0;629629+}630630+631631+/*632632+ * Fast path for the easy case of mapping a 4k page to an already allocated633633+ * table. This is a common workload. If it returns EAGAIN run the full algorithm634634+ * instead.635635+ */636636+static __always_inline int __do_map_single_page(struct pt_range *range,637637+ void *arg, unsigned int level,638638+ struct pt_table_p *table,639639+ pt_level_fn_t descend_fn)640640+{641641+ struct pt_state pts = pt_init(range, level, table);642642+ struct pt_iommu_map_args *map = arg;643643+644644+ pts.type = pt_load_single_entry(&pts);645645+ if (level == 0) {646646+ if (pts.type != PT_ENTRY_EMPTY)647647+ return -EADDRINUSE;648648+ pt_install_leaf_entry(&pts, map->oa, PAGE_SHIFT,649649+ &map->attrs);650650+ /* No flush, not used when incoherent */651651+ map->oa += PAGE_SIZE;652652+ return 0;653653+ }654654+ if (pts.type == PT_ENTRY_TABLE)655655+ return pt_descend(&pts, arg, descend_fn);656656+ /* Something else, use the slow path */657657+ return -EAGAIN;658658+}659659+PT_MAKE_LEVELS(__map_single_page, __do_map_single_page);660660+661661+/*662662+ * Add a table to the top, increasing the top level as much as necessary to663663+ * encompass range.664664+ */665665+static int increase_top(struct pt_iommu *iommu_table, struct pt_range *range,666666+ struct pt_iommu_map_args *map)667667+{668668+ struct iommu_pages_list free_list = IOMMU_PAGES_LIST_INIT(free_list);669669+ struct pt_common *common = common_from_iommu(iommu_table);670670+ uintptr_t top_of_table = READ_ONCE(common->top_of_table);671671+ uintptr_t new_top_of_table = top_of_table;672672+ struct pt_table_p *table_mem;673673+ unsigned int new_level;674674+ spinlock_t *domain_lock;675675+ unsigned long flags;676676+ int ret;677677+678678+ while (true) {679679+ struct pt_range top_range =680680+ _pt_top_range(common, new_top_of_table);681681+ struct pt_state pts = pt_init_top(&top_range);682682+683683+ top_range.va = range->va;684684+ top_range.last_va = range->last_va;685685+686686+ if (!pt_check_range(&top_range) &&687687+ map->leaf_level <= pts.level) {688688+ new_level = pts.level;689689+ break;690690+ }691691+692692+ pts.level++;693693+ if (pts.level > PT_MAX_TOP_LEVEL ||694694+ pt_table_item_lg2sz(&pts) >= common->max_vasz_lg2) {695695+ ret = -ERANGE;696696+ goto err_free;697697+ }698698+699699+ table_mem =700700+ table_alloc_top(common, _pt_top_set(NULL, pts.level),701701+ map->attrs.gfp, ALLOC_DEFER_COHERENT_FLUSH);702702+ if (IS_ERR(table_mem)) {703703+ ret = PTR_ERR(table_mem);704704+ goto err_free;705705+ }706706+ iommu_pages_list_add(&free_list, table_mem);707707+708708+ /* The new table links to the lower table always at index 0 */709709+ top_range.va = 0;710710+ top_range.top_level = pts.level;711711+ pts.table_lower = pts.table;712712+ pts.table = table_mem;713713+ pt_load_single_entry(&pts);714714+ PT_WARN_ON(pts.index != 0);715715+ pt_install_table(&pts, virt_to_phys(pts.table_lower),716716+ &map->attrs);717717+ new_top_of_table = _pt_top_set(pts.table, pts.level);718718+ }719719+720720+ /*721721+ * Avoid double flushing, flush it once after all pt_install_table()722722+ */723723+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT)) {724724+ ret = iommu_pages_start_incoherent_list(725725+ &free_list, iommu_table->iommu_device);726726+ if (ret)727727+ goto err_free;728728+ }729729+730730+ /*731731+ * top_of_table is write locked by the spinlock, but readers can use732732+ * READ_ONCE() to get the value. Since we encode both the level and the733733+ * pointer in one quanta the lockless reader will always see something734734+ * valid. The HW must be updated to the new level under the spinlock735735+ * before top_of_table is updated so that concurrent readers don't map736736+ * into the new level until it is fully functional. If another thread737737+ * already updated it while we were working then throw everything away738738+ * and try again.739739+ */740740+ domain_lock = iommu_table->driver_ops->get_top_lock(iommu_table);741741+ spin_lock_irqsave(domain_lock, flags);742742+ if (common->top_of_table != top_of_table ||743743+ top_of_table == new_top_of_table) {744744+ spin_unlock_irqrestore(domain_lock, flags);745745+ ret = -EAGAIN;746746+ goto err_free;747747+ }748748+749749+ /*750750+ * We do not issue any flushes for change_top on the expectation that751751+ * any walk cache will not become a problem by adding another layer to752752+ * the tree. Misses will rewalk from the updated top pointer, hits753753+ * continue to be correct. Negative caching is fine too since all the754754+ * new IOVA added by the new top is non-present.755755+ */756756+ iommu_table->driver_ops->change_top(757757+ iommu_table, virt_to_phys(table_mem), new_level);758758+ WRITE_ONCE(common->top_of_table, new_top_of_table);759759+ spin_unlock_irqrestore(domain_lock, flags);760760+ return 0;761761+762762+err_free:763763+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT))764764+ iommu_pages_stop_incoherent_list(&free_list,765765+ iommu_table->iommu_device);766766+ iommu_put_pages_list(&free_list);767767+ return ret;768768+}769769+770770+static int check_map_range(struct pt_iommu *iommu_table, struct pt_range *range,771771+ struct pt_iommu_map_args *map)772772+{773773+ struct pt_common *common = common_from_iommu(iommu_table);774774+ int ret;775775+776776+ do {777777+ ret = pt_check_range(range);778778+ if (!pt_feature(common, PT_FEAT_DYNAMIC_TOP))779779+ return ret;780780+781781+ if (!ret && map->leaf_level <= range->top_level)782782+ break;783783+784784+ ret = increase_top(iommu_table, range, map);785785+ if (ret && ret != -EAGAIN)786786+ return ret;787787+788788+ /* Reload the new top */789789+ *range = pt_make_range(common, range->va, range->last_va);790790+ } while (ret);791791+ PT_WARN_ON(pt_check_range(range));792792+ return 0;793793+}794794+795795+static int do_map(struct pt_range *range, struct pt_common *common,796796+ bool single_page, struct pt_iommu_map_args *map)797797+{798798+ /*799799+ * The __map_single_page() fast path does not support DMA_INCOHERENT800800+ * flushing to keep its .text small.801801+ */802802+ if (single_page && !pt_feature(common, PT_FEAT_DMA_INCOHERENT)) {803803+ int ret;804804+805805+ ret = pt_walk_range(range, __map_single_page, map);806806+ if (ret != -EAGAIN)807807+ return ret;808808+ /* EAGAIN falls through to the full path */809809+ }810810+811811+ if (map->leaf_level == range->top_level)812812+ return pt_walk_range(range, __map_range_leaf, map);813813+ return pt_walk_range(range, __map_range, map);814814+}815815+816816+/**817817+ * map_pages() - Install translation for an IOVA range818818+ * @domain: Domain to manipulate819819+ * @iova: IO virtual address to start820820+ * @paddr: Physical/Output address to start821821+ * @pgsize: Length of each page822822+ * @pgcount: Length of the range in pgsize units starting from @iova823823+ * @prot: A bitmap of IOMMU_READ/WRITE/CACHE/NOEXEC/MMIO824824+ * @gfp: GFP flags for any memory allocations825825+ * @mapped: Total bytes successfully mapped826826+ *827827+ * The range starting at IOVA will have paddr installed into it. The caller828828+ * must specify a valid pgsize and pgcount to segment the range into compatible829829+ * blocks.830830+ *831831+ * On error the caller will probably want to invoke unmap on the range from iova832832+ * up to the amount indicated by @mapped to return the table back to an833833+ * unchanged state.834834+ *835835+ * Context: The caller must hold a write range lock that includes the whole836836+ * range.837837+ *838838+ * Returns: -ERRNO on failure, 0 on success. The number of bytes of VA that were839839+ * mapped are added to @mapped, @mapped is not zerod first.840840+ */841841+int DOMAIN_NS(map_pages)(struct iommu_domain *domain, unsigned long iova,842842+ phys_addr_t paddr, size_t pgsize, size_t pgcount,843843+ int prot, gfp_t gfp, size_t *mapped)844844+{845845+ struct pt_iommu *iommu_table =846846+ container_of(domain, struct pt_iommu, domain);847847+ pt_vaddr_t pgsize_bitmap = iommu_table->domain.pgsize_bitmap;848848+ struct pt_common *common = common_from_iommu(iommu_table);849849+ struct iommu_iotlb_gather iotlb_gather;850850+ pt_vaddr_t len = pgsize * pgcount;851851+ struct pt_iommu_map_args map = {852852+ .iotlb_gather = &iotlb_gather,853853+ .oa = paddr,854854+ .leaf_pgsize_lg2 = vaffs(pgsize),855855+ };856856+ bool single_page = false;857857+ struct pt_range range;858858+ int ret;859859+860860+ iommu_iotlb_gather_init(&iotlb_gather);861861+862862+ if (WARN_ON(!(prot & (IOMMU_READ | IOMMU_WRITE))))863863+ return -EINVAL;864864+865865+ /* Check the paddr doesn't exceed what the table can store */866866+ if ((sizeof(pt_oaddr_t) < sizeof(paddr) &&867867+ (pt_vaddr_t)paddr > PT_VADDR_MAX) ||868868+ (common->max_oasz_lg2 != PT_VADDR_MAX_LG2 &&869869+ oalog2_div(paddr, common->max_oasz_lg2)))870870+ return -ERANGE;871871+872872+ ret = pt_iommu_set_prot(common, &map.attrs, prot);873873+ if (ret)874874+ return ret;875875+ map.attrs.gfp = gfp;876876+877877+ ret = make_range_no_check(common, &range, iova, len);878878+ if (ret)879879+ return ret;880880+881881+ /* Calculate target page size and level for the leaves */882882+ if (pt_has_system_page_size(common) && pgsize == PAGE_SIZE &&883883+ pgcount == 1) {884884+ PT_WARN_ON(!(pgsize_bitmap & PAGE_SIZE));885885+ if (log2_mod(iova | paddr, PAGE_SHIFT))886886+ return -ENXIO;887887+ map.leaf_pgsize_lg2 = PAGE_SHIFT;888888+ map.leaf_level = 0;889889+ single_page = true;890890+ } else {891891+ map.leaf_pgsize_lg2 = pt_compute_best_pgsize(892892+ pgsize_bitmap, range.va, range.last_va, paddr);893893+ if (!map.leaf_pgsize_lg2)894894+ return -ENXIO;895895+ map.leaf_level =896896+ pt_pgsz_lg2_to_level(common, map.leaf_pgsize_lg2);897897+ }898898+899899+ ret = check_map_range(iommu_table, &range, &map);900900+ if (ret)901901+ return ret;902902+903903+ PT_WARN_ON(map.leaf_level > range.top_level);904904+905905+ ret = do_map(&range, common, single_page, &map);906906+907907+ /*908908+ * Table levels were freed and replaced with large items, flush any walk909909+ * cache that may refer to the freed levels.910910+ */911911+ if (!iommu_pages_list_empty(&iotlb_gather.freelist))912912+ iommu_iotlb_sync(&iommu_table->domain, &iotlb_gather);913913+914914+ /* Bytes successfully mapped */915915+ PT_WARN_ON(!ret && map.oa - paddr != len);916916+ *mapped += map.oa - paddr;917917+ return ret;918918+}919919+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(map_pages), "GENERIC_PT_IOMMU");920920+921921+struct pt_unmap_args {922922+ struct iommu_pages_list free_list;923923+ pt_vaddr_t unmapped;924924+};925925+926926+static __maybe_unused int __unmap_range(struct pt_range *range, void *arg,927927+ unsigned int level,928928+ struct pt_table_p *table)929929+{930930+ struct pt_state pts = pt_init(range, level, table);931931+ struct pt_unmap_args *unmap = arg;932932+ unsigned int num_oas = 0;933933+ unsigned int start_index;934934+ int ret = 0;935935+936936+ _pt_iter_first(&pts);937937+ start_index = pts.index;938938+ pts.type = pt_load_entry_raw(&pts);939939+ /*940940+ * A starting index is in the middle of a contiguous entry941941+ *942942+ * The IOMMU API does not require drivers to support unmapping parts of943943+ * large pages. Long ago VFIO would try to split maps but the current944944+ * version never does.945945+ *946946+ * Instead when unmap reaches a partial unmap of the start of a large947947+ * IOPTE it should remove the entire IOPTE and return that size to the948948+ * caller.949949+ */950950+ if (pts.type == PT_ENTRY_OA) {951951+ if (log2_mod(range->va, pt_entry_oa_lg2sz(&pts)))952952+ return -EINVAL;953953+ /* Micro optimization */954954+ goto start_oa;955955+ }956956+957957+ do {958958+ if (pts.type != PT_ENTRY_OA) {959959+ bool fully_covered;960960+961961+ if (pts.type != PT_ENTRY_TABLE) {962962+ ret = -EINVAL;963963+ break;964964+ }965965+966966+ if (pts.index != start_index)967967+ pt_index_to_va(&pts);968968+ pts.table_lower = pt_table_ptr(&pts);969969+970970+ fully_covered = pt_entry_fully_covered(971971+ &pts, pt_table_item_lg2sz(&pts));972972+973973+ ret = pt_descend(&pts, arg, __unmap_range);974974+ if (ret)975975+ break;976976+977977+ /*978978+ * If the unmapping range fully covers the table then we979979+ * can free it as well. The clear is delayed until we980980+ * succeed in clearing the lower table levels.981981+ */982982+ if (fully_covered) {983983+ iommu_pages_list_add(&unmap->free_list,984984+ pts.table_lower);985985+ pt_clear_entries(&pts, ilog2(1));986986+ }987987+ pts.index++;988988+ } else {989989+ unsigned int num_contig_lg2;990990+start_oa:991991+ /*992992+ * If the caller requested an last that falls within a993993+ * single entry then the entire entry is unmapped and994994+ * the length returned will be larger than requested.995995+ */996996+ num_contig_lg2 = pt_entry_num_contig_lg2(&pts);997997+ pt_clear_entries(&pts, num_contig_lg2);998998+ num_oas += log2_to_int(num_contig_lg2);999999+ pts.index += log2_to_int(num_contig_lg2);10001000+ }10011001+ if (pts.index >= pts.end_index)10021002+ break;10031003+ pts.type = pt_load_entry_raw(&pts);10041004+ } while (true);10051005+10061006+ unmap->unmapped += log2_mul(num_oas, pt_table_item_lg2sz(&pts));10071007+ flush_writes_range(&pts, start_index, pts.index);10081008+10091009+ return ret;10101010+}10111011+10121012+/**10131013+ * unmap_pages() - Make a range of IOVA empty/not present10141014+ * @domain: Domain to manipulate10151015+ * @iova: IO virtual address to start10161016+ * @pgsize: Length of each page10171017+ * @pgcount: Length of the range in pgsize units starting from @iova10181018+ * @iotlb_gather: Gather struct that must be flushed on return10191019+ *10201020+ * unmap_pages() will remove a translation created by map_pages(). It cannot10211021+ * subdivide a mapping created by map_pages(), so it should be called with IOVA10221022+ * ranges that match those passed to map_pages(). The IOVA range can aggregate10231023+ * contiguous map_pages() calls so long as no individual range is split.10241024+ *10251025+ * Context: The caller must hold a write range lock that includes10261026+ * the whole range.10271027+ *10281028+ * Returns: Number of bytes of VA unmapped. iova + res will be the point10291029+ * unmapping stopped.10301030+ */10311031+size_t DOMAIN_NS(unmap_pages)(struct iommu_domain *domain, unsigned long iova,10321032+ size_t pgsize, size_t pgcount,10331033+ struct iommu_iotlb_gather *iotlb_gather)10341034+{10351035+ struct pt_iommu *iommu_table =10361036+ container_of(domain, struct pt_iommu, domain);10371037+ struct pt_unmap_args unmap = { .free_list = IOMMU_PAGES_LIST_INIT(10381038+ unmap.free_list) };10391039+ pt_vaddr_t len = pgsize * pgcount;10401040+ struct pt_range range;10411041+ int ret;10421042+10431043+ ret = make_range(common_from_iommu(iommu_table), &range, iova, len);10441044+ if (ret)10451045+ return 0;10461046+10471047+ pt_walk_range(&range, __unmap_range, &unmap);10481048+10491049+ gather_range_pages(iotlb_gather, iommu_table, iova, len,10501050+ &unmap.free_list);10511051+10521052+ return unmap.unmapped;10531053+}10541054+EXPORT_SYMBOL_NS_GPL(DOMAIN_NS(unmap_pages), "GENERIC_PT_IOMMU");10551055+10561056+static void NS(get_info)(struct pt_iommu *iommu_table,10571057+ struct pt_iommu_info *info)10581058+{10591059+ struct pt_common *common = common_from_iommu(iommu_table);10601060+ struct pt_range range = pt_top_range(common);10611061+ struct pt_state pts = pt_init_top(&range);10621062+ pt_vaddr_t pgsize_bitmap = 0;10631063+10641064+ if (pt_feature(common, PT_FEAT_DYNAMIC_TOP)) {10651065+ for (pts.level = 0; pts.level <= PT_MAX_TOP_LEVEL;10661066+ pts.level++) {10671067+ if (pt_table_item_lg2sz(&pts) >= common->max_vasz_lg2)10681068+ break;10691069+ pgsize_bitmap |= pt_possible_sizes(&pts);10701070+ }10711071+ } else {10721072+ for (pts.level = 0; pts.level <= range.top_level; pts.level++)10731073+ pgsize_bitmap |= pt_possible_sizes(&pts);10741074+ }10751075+10761076+ /* Hide page sizes larger than the maximum OA */10771077+ info->pgsize_bitmap = oalog2_mod(pgsize_bitmap, common->max_oasz_lg2);10781078+}10791079+10801080+static void NS(deinit)(struct pt_iommu *iommu_table)10811081+{10821082+ struct pt_common *common = common_from_iommu(iommu_table);10831083+ struct pt_range range = pt_all_range(common);10841084+ struct pt_iommu_collect_args collect = {10851085+ .free_list = IOMMU_PAGES_LIST_INIT(collect.free_list),10861086+ };10871087+10881088+ iommu_pages_list_add(&collect.free_list, range.top_table);10891089+ pt_walk_range(&range, __collect_tables, &collect);10901090+10911091+ /*10921092+ * The driver has to already have fenced the HW access to the page table10931093+ * and invalidated any caching referring to this memory.10941094+ */10951095+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT))10961096+ iommu_pages_stop_incoherent_list(&collect.free_list,10971097+ iommu_table->iommu_device);10981098+ iommu_put_pages_list(&collect.free_list);10991099+}11001100+11011101+static const struct pt_iommu_ops NS(ops) = {11021102+#if IS_ENABLED(CONFIG_IOMMUFD_DRIVER) && defined(pt_entry_is_write_dirty) && \11031103+ IS_ENABLED(CONFIG_IOMMUFD_TEST) && defined(pt_entry_make_write_dirty)11041104+ .set_dirty = NS(set_dirty),11051105+#endif11061106+ .get_info = NS(get_info),11071107+ .deinit = NS(deinit),11081108+};11091109+11101110+static int pt_init_common(struct pt_common *common)11111111+{11121112+ struct pt_range top_range = pt_top_range(common);11131113+11141114+ if (PT_WARN_ON(top_range.top_level > PT_MAX_TOP_LEVEL))11151115+ return -EINVAL;11161116+11171117+ if (top_range.top_level == PT_MAX_TOP_LEVEL ||11181118+ common->max_vasz_lg2 == top_range.max_vasz_lg2)11191119+ common->features &= ~BIT(PT_FEAT_DYNAMIC_TOP);11201120+11211121+ if (top_range.max_vasz_lg2 == PT_VADDR_MAX_LG2)11221122+ common->features |= BIT(PT_FEAT_FULL_VA);11231123+11241124+ /* Requested features must match features compiled into this format */11251125+ if ((common->features & ~(unsigned int)PT_SUPPORTED_FEATURES) ||11261126+ (!IS_ENABLED(CONFIG_DEBUG_GENERIC_PT) &&11271127+ (common->features & PT_FORCE_ENABLED_FEATURES) !=11281128+ PT_FORCE_ENABLED_FEATURES))11291129+ return -EOPNOTSUPP;11301130+11311131+ /*11321132+ * Check if the top level of the page table is too small to hold the11331133+ * specified maxvasz.11341134+ */11351135+ if (!pt_feature(common, PT_FEAT_DYNAMIC_TOP) &&11361136+ top_range.top_level != PT_MAX_TOP_LEVEL) {11371137+ struct pt_state pts = { .range = &top_range,11381138+ .level = top_range.top_level };11391139+11401140+ if (common->max_vasz_lg2 >11411141+ pt_num_items_lg2(&pts) + pt_table_item_lg2sz(&pts))11421142+ return -EOPNOTSUPP;11431143+ }11441144+11451145+ if (common->max_oasz_lg2 == 0)11461146+ common->max_oasz_lg2 = pt_max_oa_lg2(common);11471147+ else11481148+ common->max_oasz_lg2 = min(common->max_oasz_lg2,11491149+ pt_max_oa_lg2(common));11501150+ return 0;11511151+}11521152+11531153+static int pt_iommu_init_domain(struct pt_iommu *iommu_table,11541154+ struct iommu_domain *domain)11551155+{11561156+ struct pt_common *common = common_from_iommu(iommu_table);11571157+ struct pt_iommu_info info;11581158+ struct pt_range range;11591159+11601160+ NS(get_info)(iommu_table, &info);11611161+11621162+ domain->type = __IOMMU_DOMAIN_PAGING;11631163+ domain->pgsize_bitmap = info.pgsize_bitmap;11641164+11651165+ if (pt_feature(common, PT_FEAT_DYNAMIC_TOP))11661166+ range = _pt_top_range(common,11671167+ _pt_top_set(NULL, PT_MAX_TOP_LEVEL));11681168+ else11691169+ range = pt_top_range(common);11701170+11711171+ /* A 64-bit high address space table on a 32-bit system cannot work. */11721172+ domain->geometry.aperture_start = (unsigned long)range.va;11731173+ if ((pt_vaddr_t)domain->geometry.aperture_start != range.va)11741174+ return -EOVERFLOW;11751175+11761176+ /*11771177+ * The aperture is limited to what the API can do after considering all11781178+ * the different types dma_addr_t/unsigned long/pt_vaddr_t that are used11791179+ * to store a VA. Set the aperture to something that is valid for all11801180+ * cases. Saturate instead of truncate the end if the types are smaller11811181+ * than the top range. aperture_end should be called aperture_last.11821182+ */11831183+ domain->geometry.aperture_end = (unsigned long)range.last_va;11841184+ if ((pt_vaddr_t)domain->geometry.aperture_end != range.last_va) {11851185+ domain->geometry.aperture_end = ULONG_MAX;11861186+ domain->pgsize_bitmap &= ULONG_MAX;11871187+ }11881188+ domain->geometry.force_aperture = true;11891189+11901190+ return 0;11911191+}11921192+11931193+static void pt_iommu_zero(struct pt_iommu_table *fmt_table)11941194+{11951195+ struct pt_iommu *iommu_table = &fmt_table->iommu;11961196+ struct pt_iommu cfg = *iommu_table;11971197+11981198+ static_assert(offsetof(struct pt_iommu_table, iommu.domain) == 0);11991199+ memset_after(fmt_table, 0, iommu.domain);12001200+12011201+ /* The caller can initialize some of these values */12021202+ iommu_table->iommu_device = cfg.iommu_device;12031203+ iommu_table->driver_ops = cfg.driver_ops;12041204+ iommu_table->nid = cfg.nid;12051205+}12061206+12071207+#define pt_iommu_table_cfg CONCATENATE(pt_iommu_table, _cfg)12081208+#define pt_iommu_init CONCATENATE(CONCATENATE(pt_iommu_, PTPFX), init)12091209+12101210+int pt_iommu_init(struct pt_iommu_table *fmt_table,12111211+ const struct pt_iommu_table_cfg *cfg, gfp_t gfp)12121212+{12131213+ struct pt_iommu *iommu_table = &fmt_table->iommu;12141214+ struct pt_common *common = common_from_iommu(iommu_table);12151215+ struct pt_table_p *table_mem;12161216+ int ret;12171217+12181218+ if (cfg->common.hw_max_vasz_lg2 > PT_MAX_VA_ADDRESS_LG2 ||12191219+ !cfg->common.hw_max_vasz_lg2 || !cfg->common.hw_max_oasz_lg2)12201220+ return -EINVAL;12211221+12221222+ pt_iommu_zero(fmt_table);12231223+ common->features = cfg->common.features;12241224+ common->max_vasz_lg2 = cfg->common.hw_max_vasz_lg2;12251225+ common->max_oasz_lg2 = cfg->common.hw_max_oasz_lg2;12261226+ ret = pt_iommu_fmt_init(fmt_table, cfg);12271227+ if (ret)12281228+ return ret;12291229+12301230+ if (cfg->common.hw_max_oasz_lg2 > pt_max_oa_lg2(common))12311231+ return -EINVAL;12321232+12331233+ ret = pt_init_common(common);12341234+ if (ret)12351235+ return ret;12361236+12371237+ if (pt_feature(common, PT_FEAT_DYNAMIC_TOP) &&12381238+ WARN_ON(!iommu_table->driver_ops ||12391239+ !iommu_table->driver_ops->change_top ||12401240+ !iommu_table->driver_ops->get_top_lock))12411241+ return -EINVAL;12421242+12431243+ if (pt_feature(common, PT_FEAT_SIGN_EXTEND) &&12441244+ (pt_feature(common, PT_FEAT_FULL_VA) ||12451245+ pt_feature(common, PT_FEAT_DYNAMIC_TOP)))12461246+ return -EINVAL;12471247+12481248+ if (pt_feature(common, PT_FEAT_DMA_INCOHERENT) &&12491249+ WARN_ON(!iommu_table->iommu_device))12501250+ return -EINVAL;12511251+12521252+ ret = pt_iommu_init_domain(iommu_table, &iommu_table->domain);12531253+ if (ret)12541254+ return ret;12551255+12561256+ table_mem = table_alloc_top(common, common->top_of_table, gfp,12571257+ ALLOC_NORMAL);12581258+ if (IS_ERR(table_mem))12591259+ return PTR_ERR(table_mem);12601260+ pt_top_set(common, table_mem, pt_top_get_level(common));12611261+12621262+ /* Must be last, see pt_iommu_deinit() */12631263+ iommu_table->ops = &NS(ops);12641264+ return 0;12651265+}12661266+EXPORT_SYMBOL_NS_GPL(pt_iommu_init, "GENERIC_PT_IOMMU");12671267+12681268+#ifdef pt_iommu_fmt_hw_info12691269+#define pt_iommu_table_hw_info CONCATENATE(pt_iommu_table, _hw_info)12701270+#define pt_iommu_hw_info CONCATENATE(CONCATENATE(pt_iommu_, PTPFX), hw_info)12711271+void pt_iommu_hw_info(struct pt_iommu_table *fmt_table,12721272+ struct pt_iommu_table_hw_info *info)12731273+{12741274+ struct pt_iommu *iommu_table = &fmt_table->iommu;12751275+ struct pt_common *common = common_from_iommu(iommu_table);12761276+ struct pt_range top_range = pt_top_range(common);12771277+12781278+ pt_iommu_fmt_hw_info(fmt_table, &top_range, info);12791279+}12801280+EXPORT_SYMBOL_NS_GPL(pt_iommu_hw_info, "GENERIC_PT_IOMMU");12811281+#endif12821282+12831283+MODULE_LICENSE("GPL");12841284+MODULE_DESCRIPTION("IOMMU Page table implementation for " __stringify(PTPFX_RAW));12851285+MODULE_IMPORT_NS("GENERIC_PT");12861286+/* For iommu_dirty_bitmap_record() */12871287+MODULE_IMPORT_NS("IOMMUFD");12881288+12891289+#endif /* __GENERIC_PT_IOMMU_PT_H */
+823
drivers/iommu/generic_pt/kunit_generic_pt.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * Test the format API directly.66+ *77+ */88+#include "kunit_iommu.h"99+#include "pt_iter.h"1010+1111+static void do_map(struct kunit *test, pt_vaddr_t va, pt_oaddr_t pa,1212+ pt_vaddr_t len)1313+{1414+ struct kunit_iommu_priv *priv = test->priv;1515+ int ret;1616+1717+ KUNIT_ASSERT_EQ(test, len, (size_t)len);1818+1919+ ret = iommu_map(&priv->domain, va, pa, len, IOMMU_READ | IOMMU_WRITE,2020+ GFP_KERNEL);2121+ KUNIT_ASSERT_NO_ERRNO_FN(test, "map_pages", ret);2222+}2323+2424+#define KUNIT_ASSERT_PT_LOAD(test, pts, entry) \2525+ ({ \2626+ pt_load_entry(pts); \2727+ KUNIT_ASSERT_EQ(test, (pts)->type, entry); \2828+ })2929+3030+struct check_levels_arg {3131+ struct kunit *test;3232+ void *fn_arg;3333+ void (*fn)(struct kunit *test, struct pt_state *pts, void *arg);3434+};3535+3636+static int __check_all_levels(struct pt_range *range, void *arg,3737+ unsigned int level, struct pt_table_p *table)3838+{3939+ struct pt_state pts = pt_init(range, level, table);4040+ struct check_levels_arg *chk = arg;4141+ struct kunit *test = chk->test;4242+ int ret;4343+4444+ _pt_iter_first(&pts);4545+4646+4747+ /*4848+ * If we were able to use the full VA space this should always be the4949+ * last index in each table.5050+ */5151+ if (!(IS_32BIT && range->max_vasz_lg2 > 32)) {5252+ if (pt_feature(range->common, PT_FEAT_SIGN_EXTEND) &&5353+ pts.level == pts.range->top_level)5454+ KUNIT_ASSERT_EQ(test, pts.index,5555+ log2_to_int(range->max_vasz_lg2 - 1 -5656+ pt_table_item_lg2sz(&pts)) -5757+ 1);5858+ else5959+ KUNIT_ASSERT_EQ(test, pts.index,6060+ log2_to_int(pt_table_oa_lg2sz(&pts) -6161+ pt_table_item_lg2sz(&pts)) -6262+ 1);6363+ }6464+6565+ if (pt_can_have_table(&pts)) {6666+ pt_load_single_entry(&pts);6767+ KUNIT_ASSERT_EQ(test, pts.type, PT_ENTRY_TABLE);6868+ ret = pt_descend(&pts, arg, __check_all_levels);6969+ KUNIT_ASSERT_EQ(test, ret, 0);7070+7171+ /* Index 0 is used by the test */7272+ if (IS_32BIT && !pts.index)7373+ return 0;7474+ KUNIT_ASSERT_NE(chk->test, pts.index, 0);7575+ }7676+7777+ /*7878+ * A format should not create a table with only one entry, at least this7979+ * test approach won't work.8080+ */8181+ KUNIT_ASSERT_GT(chk->test, pts.end_index, 1);8282+8383+ /*8484+ * For increase top we end up using index 0 for the original top's tree,8585+ * so use index 1 for testing instead.8686+ */8787+ pts.index = 0;8888+ pt_index_to_va(&pts);8989+ pt_load_single_entry(&pts);9090+ if (pts.type == PT_ENTRY_TABLE && pts.end_index > 2) {9191+ pts.index = 1;9292+ pt_index_to_va(&pts);9393+ }9494+ (*chk->fn)(chk->test, &pts, chk->fn_arg);9595+ return 0;9696+}9797+9898+/*9999+ * Call fn for each level in the table with a pts setup to index 0 in a table100100+ * for that level. This allows writing tests that run on every level.101101+ * The test can use every index in the table except the last one.102102+ */103103+static void check_all_levels(struct kunit *test,104104+ void (*fn)(struct kunit *test,105105+ struct pt_state *pts, void *arg),106106+ void *fn_arg)107107+{108108+ struct kunit_iommu_priv *priv = test->priv;109109+ struct pt_range range = pt_top_range(priv->common);110110+ struct check_levels_arg chk = {111111+ .test = test,112112+ .fn = fn,113113+ .fn_arg = fn_arg,114114+ };115115+ int ret;116116+117117+ if (pt_feature(priv->common, PT_FEAT_DYNAMIC_TOP) &&118118+ priv->common->max_vasz_lg2 > range.max_vasz_lg2)119119+ range.last_va = fvalog2_set_mod_max(range.va,120120+ priv->common->max_vasz_lg2);121121+122122+ /*123123+ * Map a page at the highest VA, this will populate all the levels so we124124+ * can then iterate over them. Index 0 will be used for testing.125125+ */126126+ if (IS_32BIT && range.max_vasz_lg2 > 32)127127+ range.last_va = (u32)range.last_va;128128+ range.va = range.last_va - (priv->smallest_pgsz - 1);129129+ do_map(test, range.va, 0, priv->smallest_pgsz);130130+131131+ range = pt_make_range(priv->common, range.va, range.last_va);132132+ ret = pt_walk_range(&range, __check_all_levels, &chk);133133+ KUNIT_ASSERT_EQ(test, ret, 0);134134+}135135+136136+static void test_init(struct kunit *test)137137+{138138+ struct kunit_iommu_priv *priv = test->priv;139139+140140+ /* Fixture does the setup */141141+ KUNIT_ASSERT_NE(test, priv->info.pgsize_bitmap, 0);142142+}143143+144144+/*145145+ * Basic check that the log2_* functions are working, especially at the integer146146+ * limits.147147+ */148148+static void test_bitops(struct kunit *test)149149+{150150+ int i;151151+152152+ KUNIT_ASSERT_EQ(test, fls_t(u32, 0), 0);153153+ KUNIT_ASSERT_EQ(test, fls_t(u32, 1), 1);154154+ KUNIT_ASSERT_EQ(test, fls_t(u32, BIT(2)), 3);155155+ KUNIT_ASSERT_EQ(test, fls_t(u32, U32_MAX), 32);156156+157157+ KUNIT_ASSERT_EQ(test, fls_t(u64, 0), 0);158158+ KUNIT_ASSERT_EQ(test, fls_t(u64, 1), 1);159159+ KUNIT_ASSERT_EQ(test, fls_t(u64, BIT(2)), 3);160160+ KUNIT_ASSERT_EQ(test, fls_t(u64, U64_MAX), 64);161161+162162+ KUNIT_ASSERT_EQ(test, ffs_t(u32, 1), 0);163163+ KUNIT_ASSERT_EQ(test, ffs_t(u32, BIT(2)), 2);164164+ KUNIT_ASSERT_EQ(test, ffs_t(u32, BIT(31)), 31);165165+166166+ KUNIT_ASSERT_EQ(test, ffs_t(u64, 1), 0);167167+ KUNIT_ASSERT_EQ(test, ffs_t(u64, BIT(2)), 2);168168+ KUNIT_ASSERT_EQ(test, ffs_t(u64, BIT_ULL(63)), 63);169169+170170+ for (i = 0; i != 31; i++)171171+ KUNIT_ASSERT_EQ(test, ffz_t(u64, BIT_ULL(i) - 1), i);172172+173173+ for (i = 0; i != 63; i++)174174+ KUNIT_ASSERT_EQ(test, ffz_t(u64, BIT_ULL(i) - 1), i);175175+176176+ for (i = 0; i != 32; i++) {177177+ u64 val = get_random_u64();178178+179179+ KUNIT_ASSERT_EQ(test, log2_mod_t(u32, val, ffs_t(u32, val)), 0);180180+ KUNIT_ASSERT_EQ(test, log2_mod_t(u64, val, ffs_t(u64, val)), 0);181181+182182+ KUNIT_ASSERT_EQ(test, log2_mod_t(u32, val, ffz_t(u32, val)),183183+ log2_to_max_int_t(u32, ffz_t(u32, val)));184184+ KUNIT_ASSERT_EQ(test, log2_mod_t(u64, val, ffz_t(u64, val)),185185+ log2_to_max_int_t(u64, ffz_t(u64, val)));186186+ }187187+}188188+189189+static unsigned int ref_best_pgsize(pt_vaddr_t pgsz_bitmap, pt_vaddr_t va,190190+ pt_vaddr_t last_va, pt_oaddr_t oa)191191+{192192+ pt_vaddr_t pgsz_lg2;193193+194194+ /* Brute force the constraints described in pt_compute_best_pgsize() */195195+ for (pgsz_lg2 = PT_VADDR_MAX_LG2 - 1; pgsz_lg2 != 0; pgsz_lg2--) {196196+ if ((pgsz_bitmap & log2_to_int(pgsz_lg2)) &&197197+ log2_mod(va, pgsz_lg2) == 0 &&198198+ oalog2_mod(oa, pgsz_lg2) == 0 &&199199+ va + log2_to_int(pgsz_lg2) - 1 <= last_va &&200200+ log2_div_eq(va, va + log2_to_int(pgsz_lg2) - 1, pgsz_lg2) &&201201+ oalog2_div_eq(oa, oa + log2_to_int(pgsz_lg2) - 1, pgsz_lg2))202202+ return pgsz_lg2;203203+ }204204+ return 0;205205+}206206+207207+/* Check that the bit logic in pt_compute_best_pgsize() works. */208208+static void test_best_pgsize(struct kunit *test)209209+{210210+ unsigned int a_lg2;211211+ unsigned int b_lg2;212212+ unsigned int c_lg2;213213+214214+ /* Try random prefixes with every suffix combination */215215+ for (a_lg2 = 1; a_lg2 != 10; a_lg2++) {216216+ for (b_lg2 = 1; b_lg2 != 10; b_lg2++) {217217+ for (c_lg2 = 1; c_lg2 != 10; c_lg2++) {218218+ pt_vaddr_t pgsz_bitmap = get_random_u64();219219+ pt_vaddr_t va = get_random_u64() << a_lg2;220220+ pt_oaddr_t oa = get_random_u64() << b_lg2;221221+ pt_vaddr_t last_va = log2_set_mod_max(222222+ get_random_u64(), c_lg2);223223+224224+ if (va > last_va)225225+ swap(va, last_va);226226+ KUNIT_ASSERT_EQ(227227+ test,228228+ pt_compute_best_pgsize(pgsz_bitmap, va,229229+ last_va, oa),230230+ ref_best_pgsize(pgsz_bitmap, va,231231+ last_va, oa));232232+ }233233+ }234234+ }235235+236236+ /* 0 prefix, every suffix */237237+ for (c_lg2 = 1; c_lg2 != PT_VADDR_MAX_LG2 - 1; c_lg2++) {238238+ pt_vaddr_t pgsz_bitmap = get_random_u64();239239+ pt_vaddr_t va = 0;240240+ pt_oaddr_t oa = 0;241241+ pt_vaddr_t last_va = log2_set_mod_max(0, c_lg2);242242+243243+ KUNIT_ASSERT_EQ(test,244244+ pt_compute_best_pgsize(pgsz_bitmap, va, last_va,245245+ oa),246246+ ref_best_pgsize(pgsz_bitmap, va, last_va, oa));247247+ }248248+249249+ /* 1's prefix, every suffix */250250+ for (a_lg2 = 1; a_lg2 != 10; a_lg2++) {251251+ for (b_lg2 = 1; b_lg2 != 10; b_lg2++) {252252+ for (c_lg2 = 1; c_lg2 != 10; c_lg2++) {253253+ pt_vaddr_t pgsz_bitmap = get_random_u64();254254+ pt_vaddr_t va = PT_VADDR_MAX << a_lg2;255255+ pt_oaddr_t oa = PT_VADDR_MAX << b_lg2;256256+ pt_vaddr_t last_va = PT_VADDR_MAX;257257+258258+ KUNIT_ASSERT_EQ(259259+ test,260260+ pt_compute_best_pgsize(pgsz_bitmap, va,261261+ last_va, oa),262262+ ref_best_pgsize(pgsz_bitmap, va,263263+ last_va, oa));264264+ }265265+ }266266+ }267267+268268+ /* pgsize_bitmap is always 0 */269269+ for (a_lg2 = 1; a_lg2 != 10; a_lg2++) {270270+ for (b_lg2 = 1; b_lg2 != 10; b_lg2++) {271271+ for (c_lg2 = 1; c_lg2 != 10; c_lg2++) {272272+ pt_vaddr_t pgsz_bitmap = 0;273273+ pt_vaddr_t va = get_random_u64() << a_lg2;274274+ pt_oaddr_t oa = get_random_u64() << b_lg2;275275+ pt_vaddr_t last_va = log2_set_mod_max(276276+ get_random_u64(), c_lg2);277277+278278+ if (va > last_va)279279+ swap(va, last_va);280280+ KUNIT_ASSERT_EQ(281281+ test,282282+ pt_compute_best_pgsize(pgsz_bitmap, va,283283+ last_va, oa),284284+ 0);285285+ }286286+ }287287+ }288288+289289+ if (sizeof(pt_vaddr_t) <= 4)290290+ return;291291+292292+ /* over 32 bit page sizes */293293+ for (a_lg2 = 32; a_lg2 != 42; a_lg2++) {294294+ for (b_lg2 = 32; b_lg2 != 42; b_lg2++) {295295+ for (c_lg2 = 32; c_lg2 != 42; c_lg2++) {296296+ pt_vaddr_t pgsz_bitmap = get_random_u64();297297+ pt_vaddr_t va = get_random_u64() << a_lg2;298298+ pt_oaddr_t oa = get_random_u64() << b_lg2;299299+ pt_vaddr_t last_va = log2_set_mod_max(300300+ get_random_u64(), c_lg2);301301+302302+ if (va > last_va)303303+ swap(va, last_va);304304+ KUNIT_ASSERT_EQ(305305+ test,306306+ pt_compute_best_pgsize(pgsz_bitmap, va,307307+ last_va, oa),308308+ ref_best_pgsize(pgsz_bitmap, va,309309+ last_va, oa));310310+ }311311+ }312312+ }313313+}314314+315315+/*316316+ * Check that pt_install_table() and pt_table_pa() match317317+ */318318+static void test_lvl_table_ptr(struct kunit *test, struct pt_state *pts,319319+ void *arg)320320+{321321+ struct kunit_iommu_priv *priv = test->priv;322322+ pt_oaddr_t paddr =323323+ log2_set_mod(priv->test_oa, 0, priv->smallest_pgsz_lg2);324324+ struct pt_write_attrs attrs = {};325325+326326+ if (!pt_can_have_table(pts))327327+ return;328328+329329+ KUNIT_ASSERT_NO_ERRNO_FN(test, "pt_iommu_set_prot",330330+ pt_iommu_set_prot(pts->range->common, &attrs,331331+ IOMMU_READ));332332+333333+ pt_load_single_entry(pts);334334+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);335335+336336+ KUNIT_ASSERT_TRUE(test, pt_install_table(pts, paddr, &attrs));337337+338338+ /* A second install should pass because install updates pts->entry. */339339+ KUNIT_ASSERT_EQ(test, pt_install_table(pts, paddr, &attrs), true);340340+341341+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_TABLE);342342+ KUNIT_ASSERT_EQ(test, pt_table_pa(pts), paddr);343343+344344+ pt_clear_entries(pts, ilog2(1));345345+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);346346+}347347+348348+static void test_table_ptr(struct kunit *test)349349+{350350+ check_all_levels(test, test_lvl_table_ptr, NULL);351351+}352352+353353+struct lvl_radix_arg {354354+ pt_vaddr_t vbits;355355+};356356+357357+/*358358+ * Check pt_table_oa_lg2sz() and pt_table_item_lg2sz() they need to decode a359359+ * continuous list of VA across all the levels that covers the entire advertised360360+ * VA space.361361+ */362362+static void test_lvl_radix(struct kunit *test, struct pt_state *pts, void *arg)363363+{364364+ unsigned int table_lg2sz = pt_table_oa_lg2sz(pts);365365+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);366366+ struct lvl_radix_arg *radix = arg;367367+368368+ /* Every bit below us is decoded */369369+ KUNIT_ASSERT_EQ(test, log2_set_mod_max(0, isz_lg2), radix->vbits);370370+371371+ /* We are not decoding bits someone else is */372372+ KUNIT_ASSERT_EQ(test, log2_div(radix->vbits, isz_lg2), 0);373373+374374+ /* Can't decode past the pt_vaddr_t size */375375+ KUNIT_ASSERT_LE(test, table_lg2sz, PT_VADDR_MAX_LG2);376376+ KUNIT_ASSERT_EQ(test, fvalog2_div(table_lg2sz, PT_MAX_VA_ADDRESS_LG2),377377+ 0);378378+379379+ radix->vbits = fvalog2_set_mod_max(0, table_lg2sz);380380+}381381+382382+static void test_max_va(struct kunit *test)383383+{384384+ struct kunit_iommu_priv *priv = test->priv;385385+ struct pt_range range = pt_top_range(priv->common);386386+387387+ KUNIT_ASSERT_GE(test, priv->common->max_vasz_lg2, range.max_vasz_lg2);388388+}389389+390390+static void test_table_radix(struct kunit *test)391391+{392392+ struct kunit_iommu_priv *priv = test->priv;393393+ struct lvl_radix_arg radix = { .vbits = priv->smallest_pgsz - 1 };394394+ struct pt_range range;395395+396396+ check_all_levels(test, test_lvl_radix, &radix);397397+398398+ range = pt_top_range(priv->common);399399+ if (range.max_vasz_lg2 == PT_VADDR_MAX_LG2) {400400+ KUNIT_ASSERT_EQ(test, radix.vbits, PT_VADDR_MAX);401401+ } else {402402+ if (!IS_32BIT)403403+ KUNIT_ASSERT_EQ(test,404404+ log2_set_mod_max(0, range.max_vasz_lg2),405405+ radix.vbits);406406+ KUNIT_ASSERT_EQ(test, log2_div(radix.vbits, range.max_vasz_lg2),407407+ 0);408408+ }409409+}410410+411411+static unsigned int safe_pt_num_items_lg2(const struct pt_state *pts)412412+{413413+ struct pt_range top_range = pt_top_range(pts->range->common);414414+ struct pt_state top_pts = pt_init_top(&top_range);415415+416416+ /*417417+ * Avoid calling pt_num_items_lg2() on the top, instead we can derive418418+ * the size of the top table from the top range.419419+ */420420+ if (pts->level == top_range.top_level)421421+ return ilog2(pt_range_to_end_index(&top_pts));422422+ return pt_num_items_lg2(pts);423423+}424424+425425+static void test_lvl_possible_sizes(struct kunit *test, struct pt_state *pts,426426+ void *arg)427427+{428428+ unsigned int num_items_lg2 = safe_pt_num_items_lg2(pts);429429+ pt_vaddr_t pgsize_bitmap = pt_possible_sizes(pts);430430+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);431431+432432+ if (!pt_can_have_leaf(pts)) {433433+ KUNIT_ASSERT_EQ(test, pgsize_bitmap, 0);434434+ return;435435+ }436436+437437+ /* No bits for sizes that would be outside this table */438438+ KUNIT_ASSERT_EQ(test, log2_mod(pgsize_bitmap, isz_lg2), 0);439439+ KUNIT_ASSERT_EQ(440440+ test, fvalog2_div(pgsize_bitmap, num_items_lg2 + isz_lg2), 0);441441+442442+ /*443443+ * Non contiguous must be supported. AMDv1 has a HW bug where it does444444+ * not support it on one of the levels.445445+ */446446+ if ((u64)pgsize_bitmap != 0xff0000000000ULL ||447447+ strcmp(__stringify(PTPFX_RAW), "amdv1") != 0)448448+ KUNIT_ASSERT_TRUE(test, pgsize_bitmap & log2_to_int(isz_lg2));449449+ else450450+ KUNIT_ASSERT_NE(test, pgsize_bitmap, 0);451451+452452+ /* A contiguous entry should not span the whole table */453453+ if (num_items_lg2 + isz_lg2 != PT_VADDR_MAX_LG2)454454+ KUNIT_ASSERT_FALSE(455455+ test,456456+ pgsize_bitmap & log2_to_int(num_items_lg2 + isz_lg2));457457+}458458+459459+static void test_entry_possible_sizes(struct kunit *test)460460+{461461+ check_all_levels(test, test_lvl_possible_sizes, NULL);462462+}463463+464464+static void sweep_all_pgsizes(struct kunit *test, struct pt_state *pts,465465+ struct pt_write_attrs *attrs,466466+ pt_oaddr_t test_oaddr)467467+{468468+ pt_vaddr_t pgsize_bitmap = pt_possible_sizes(pts);469469+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);470470+ unsigned int len_lg2;471471+472472+ if (pts->index != 0)473473+ return;474474+475475+ for (len_lg2 = 0; len_lg2 < PT_VADDR_MAX_LG2 - 1; len_lg2++) {476476+ struct pt_state sub_pts = *pts;477477+ pt_oaddr_t oaddr;478478+479479+ if (!(pgsize_bitmap & log2_to_int(len_lg2)))480480+ continue;481481+482482+ oaddr = log2_set_mod(test_oaddr, 0, len_lg2);483483+ pt_install_leaf_entry(pts, oaddr, len_lg2, attrs);484484+ /* Verify that every contiguous item translates correctly */485485+ for (sub_pts.index = 0;486486+ sub_pts.index != log2_to_int(len_lg2 - isz_lg2);487487+ sub_pts.index++) {488488+ KUNIT_ASSERT_PT_LOAD(test, &sub_pts, PT_ENTRY_OA);489489+ KUNIT_ASSERT_EQ(test, pt_item_oa(&sub_pts),490490+ oaddr + sub_pts.index *491491+ oalog2_mul(1, isz_lg2));492492+ KUNIT_ASSERT_EQ(test, pt_entry_oa(&sub_pts), oaddr);493493+ KUNIT_ASSERT_EQ(test, pt_entry_num_contig_lg2(&sub_pts),494494+ len_lg2 - isz_lg2);495495+ }496496+497497+ pt_clear_entries(pts, len_lg2 - isz_lg2);498498+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);499499+ }500500+}501501+502502+/*503503+ * Check that pt_install_leaf_entry() and pt_entry_oa() match.504504+ * Check that pt_clear_entries() works.505505+ */506506+static void test_lvl_entry_oa(struct kunit *test, struct pt_state *pts,507507+ void *arg)508508+{509509+ unsigned int max_oa_lg2 = pts->range->common->max_oasz_lg2;510510+ struct kunit_iommu_priv *priv = test->priv;511511+ struct pt_write_attrs attrs = {};512512+513513+ if (!pt_can_have_leaf(pts))514514+ return;515515+516516+ KUNIT_ASSERT_NO_ERRNO_FN(test, "pt_iommu_set_prot",517517+ pt_iommu_set_prot(pts->range->common, &attrs,518518+ IOMMU_READ));519519+520520+ sweep_all_pgsizes(test, pts, &attrs, priv->test_oa);521521+522522+ /* Check that the table can store the boundary OAs */523523+ sweep_all_pgsizes(test, pts, &attrs, 0);524524+ if (max_oa_lg2 == PT_OADDR_MAX_LG2)525525+ sweep_all_pgsizes(test, pts, &attrs, PT_OADDR_MAX);526526+ else527527+ sweep_all_pgsizes(test, pts, &attrs,528528+ oalog2_to_max_int(max_oa_lg2));529529+}530530+531531+static void test_entry_oa(struct kunit *test)532532+{533533+ check_all_levels(test, test_lvl_entry_oa, NULL);534534+}535535+536536+/* Test pt_attr_from_entry() */537537+static void test_lvl_attr_from_entry(struct kunit *test, struct pt_state *pts,538538+ void *arg)539539+{540540+ pt_vaddr_t pgsize_bitmap = pt_possible_sizes(pts);541541+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);542542+ struct kunit_iommu_priv *priv = test->priv;543543+ unsigned int len_lg2;544544+ unsigned int prot;545545+546546+ if (!pt_can_have_leaf(pts))547547+ return;548548+549549+ for (len_lg2 = 0; len_lg2 < PT_VADDR_MAX_LG2; len_lg2++) {550550+ if (!(pgsize_bitmap & log2_to_int(len_lg2)))551551+ continue;552552+ for (prot = 0; prot <= (IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE |553553+ IOMMU_NOEXEC | IOMMU_MMIO);554554+ prot++) {555555+ pt_oaddr_t oaddr;556556+ struct pt_write_attrs attrs = {};557557+ u64 good_entry;558558+559559+ /*560560+ * If the format doesn't support this combination of561561+ * prot bits skip it562562+ */563563+ if (pt_iommu_set_prot(pts->range->common, &attrs,564564+ prot)) {565565+ /* But RW has to be supported */566566+ KUNIT_ASSERT_NE(test, prot,567567+ IOMMU_READ | IOMMU_WRITE);568568+ continue;569569+ }570570+571571+ oaddr = log2_set_mod(priv->test_oa, 0, len_lg2);572572+ pt_install_leaf_entry(pts, oaddr, len_lg2, &attrs);573573+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_OA);574574+575575+ good_entry = pts->entry;576576+577577+ memset(&attrs, 0, sizeof(attrs));578578+ pt_attr_from_entry(pts, &attrs);579579+580580+ pt_clear_entries(pts, len_lg2 - isz_lg2);581581+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);582582+583583+ pt_install_leaf_entry(pts, oaddr, len_lg2, &attrs);584584+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_OA);585585+586586+ /*587587+ * The descriptor produced by pt_attr_from_entry()588588+ * produce an identical entry value when re-written589589+ */590590+ KUNIT_ASSERT_EQ(test, good_entry, pts->entry);591591+592592+ pt_clear_entries(pts, len_lg2 - isz_lg2);593593+ }594594+ }595595+}596596+597597+static void test_attr_from_entry(struct kunit *test)598598+{599599+ check_all_levels(test, test_lvl_attr_from_entry, NULL);600600+}601601+602602+static void test_lvl_dirty(struct kunit *test, struct pt_state *pts, void *arg)603603+{604604+ pt_vaddr_t pgsize_bitmap = pt_possible_sizes(pts);605605+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);606606+ struct kunit_iommu_priv *priv = test->priv;607607+ unsigned int start_idx = pts->index;608608+ struct pt_write_attrs attrs = {};609609+ unsigned int len_lg2;610610+611611+ if (!pt_can_have_leaf(pts))612612+ return;613613+614614+ KUNIT_ASSERT_NO_ERRNO_FN(test, "pt_iommu_set_prot",615615+ pt_iommu_set_prot(pts->range->common, &attrs,616616+ IOMMU_READ | IOMMU_WRITE));617617+618618+ for (len_lg2 = 0; len_lg2 < PT_VADDR_MAX_LG2; len_lg2++) {619619+ pt_oaddr_t oaddr;620620+ unsigned int i;621621+622622+ if (!(pgsize_bitmap & log2_to_int(len_lg2)))623623+ continue;624624+625625+ oaddr = log2_set_mod(priv->test_oa, 0, len_lg2);626626+ pt_install_leaf_entry(pts, oaddr, len_lg2, &attrs);627627+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_OA);628628+629629+ pt_load_entry(pts);630630+ pt_entry_make_write_clean(pts);631631+ pt_load_entry(pts);632632+ KUNIT_ASSERT_FALSE(test, pt_entry_is_write_dirty(pts));633633+634634+ for (i = 0; i != log2_to_int(len_lg2 - isz_lg2); i++) {635635+ /* dirty every contiguous entry */636636+ pts->index = start_idx + i;637637+ pt_load_entry(pts);638638+ KUNIT_ASSERT_TRUE(test, pt_entry_make_write_dirty(pts));639639+ pts->index = start_idx;640640+ pt_load_entry(pts);641641+ KUNIT_ASSERT_TRUE(test, pt_entry_is_write_dirty(pts));642642+643643+ pt_entry_make_write_clean(pts);644644+ pt_load_entry(pts);645645+ KUNIT_ASSERT_FALSE(test, pt_entry_is_write_dirty(pts));646646+ }647647+648648+ pt_clear_entries(pts, len_lg2 - isz_lg2);649649+ }650650+}651651+652652+static __maybe_unused void test_dirty(struct kunit *test)653653+{654654+ struct kunit_iommu_priv *priv = test->priv;655655+656656+ if (!pt_dirty_supported(priv->common))657657+ kunit_skip(test,658658+ "Page table features do not support dirty tracking");659659+660660+ check_all_levels(test, test_lvl_dirty, NULL);661661+}662662+663663+static void test_lvl_sw_bit_leaf(struct kunit *test, struct pt_state *pts,664664+ void *arg)665665+{666666+ struct kunit_iommu_priv *priv = test->priv;667667+ pt_vaddr_t pgsize_bitmap = pt_possible_sizes(pts);668668+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);669669+ struct pt_write_attrs attrs = {};670670+ unsigned int len_lg2;671671+672672+ if (!pt_can_have_leaf(pts))673673+ return;674674+ if (pts->index != 0)675675+ return;676676+677677+ KUNIT_ASSERT_NO_ERRNO_FN(test, "pt_iommu_set_prot",678678+ pt_iommu_set_prot(pts->range->common, &attrs,679679+ IOMMU_READ));680680+681681+ for (len_lg2 = 0; len_lg2 < PT_VADDR_MAX_LG2 - 1; len_lg2++) {682682+ pt_oaddr_t paddr = log2_set_mod(priv->test_oa, 0, len_lg2);683683+ struct pt_write_attrs new_attrs = {};684684+ unsigned int bitnr;685685+686686+ if (!(pgsize_bitmap & log2_to_int(len_lg2)))687687+ continue;688688+689689+ pt_install_leaf_entry(pts, paddr, len_lg2, &attrs);690690+691691+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common);692692+ bitnr++)693693+ KUNIT_ASSERT_FALSE(test,694694+ pt_test_sw_bit_acquire(pts, bitnr));695695+696696+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common);697697+ bitnr++) {698698+ KUNIT_ASSERT_FALSE(test,699699+ pt_test_sw_bit_acquire(pts, bitnr));700700+ pt_set_sw_bit_release(pts, bitnr);701701+ KUNIT_ASSERT_TRUE(test,702702+ pt_test_sw_bit_acquire(pts, bitnr));703703+ }704704+705705+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common);706706+ bitnr++)707707+ KUNIT_ASSERT_TRUE(test,708708+ pt_test_sw_bit_acquire(pts, bitnr));709709+710710+ KUNIT_ASSERT_EQ(test, pt_item_oa(pts), paddr);711711+712712+ /* SW bits didn't leak into the attrs */713713+ pt_attr_from_entry(pts, &new_attrs);714714+ KUNIT_ASSERT_MEMEQ(test, &new_attrs, &attrs, sizeof(attrs));715715+716716+ pt_clear_entries(pts, len_lg2 - isz_lg2);717717+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);718718+ }719719+}720720+721721+static __maybe_unused void test_sw_bit_leaf(struct kunit *test)722722+{723723+ check_all_levels(test, test_lvl_sw_bit_leaf, NULL);724724+}725725+726726+static void test_lvl_sw_bit_table(struct kunit *test, struct pt_state *pts,727727+ void *arg)728728+{729729+ struct kunit_iommu_priv *priv = test->priv;730730+ struct pt_write_attrs attrs = {};731731+ pt_oaddr_t paddr =732732+ log2_set_mod(priv->test_oa, 0, priv->smallest_pgsz_lg2);733733+ unsigned int bitnr;734734+735735+ if (!pt_can_have_leaf(pts))736736+ return;737737+ if (pts->index != 0)738738+ return;739739+740740+ KUNIT_ASSERT_NO_ERRNO_FN(test, "pt_iommu_set_prot",741741+ pt_iommu_set_prot(pts->range->common, &attrs,742742+ IOMMU_READ));743743+744744+ KUNIT_ASSERT_TRUE(test, pt_install_table(pts, paddr, &attrs));745745+746746+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common); bitnr++)747747+ KUNIT_ASSERT_FALSE(test, pt_test_sw_bit_acquire(pts, bitnr));748748+749749+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common); bitnr++) {750750+ KUNIT_ASSERT_FALSE(test, pt_test_sw_bit_acquire(pts, bitnr));751751+ pt_set_sw_bit_release(pts, bitnr);752752+ KUNIT_ASSERT_TRUE(test, pt_test_sw_bit_acquire(pts, bitnr));753753+ }754754+755755+ for (bitnr = 0; bitnr <= pt_max_sw_bit(pts->range->common); bitnr++)756756+ KUNIT_ASSERT_TRUE(test, pt_test_sw_bit_acquire(pts, bitnr));757757+758758+ KUNIT_ASSERT_EQ(test, pt_table_pa(pts), paddr);759759+760760+ pt_clear_entries(pts, ilog2(1));761761+ KUNIT_ASSERT_PT_LOAD(test, pts, PT_ENTRY_EMPTY);762762+}763763+764764+static __maybe_unused void test_sw_bit_table(struct kunit *test)765765+{766766+ check_all_levels(test, test_lvl_sw_bit_table, NULL);767767+}768768+769769+static struct kunit_case generic_pt_test_cases[] = {770770+ KUNIT_CASE_FMT(test_init),771771+ KUNIT_CASE_FMT(test_bitops),772772+ KUNIT_CASE_FMT(test_best_pgsize),773773+ KUNIT_CASE_FMT(test_table_ptr),774774+ KUNIT_CASE_FMT(test_max_va),775775+ KUNIT_CASE_FMT(test_table_radix),776776+ KUNIT_CASE_FMT(test_entry_possible_sizes),777777+ KUNIT_CASE_FMT(test_entry_oa),778778+ KUNIT_CASE_FMT(test_attr_from_entry),779779+#ifdef pt_entry_is_write_dirty780780+ KUNIT_CASE_FMT(test_dirty),781781+#endif782782+#ifdef pt_sw_bit783783+ KUNIT_CASE_FMT(test_sw_bit_leaf),784784+ KUNIT_CASE_FMT(test_sw_bit_table),785785+#endif786786+ {},787787+};788788+789789+static int pt_kunit_generic_pt_init(struct kunit *test)790790+{791791+ struct kunit_iommu_priv *priv;792792+ int ret;793793+794794+ priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);795795+ if (!priv)796796+ return -ENOMEM;797797+ ret = pt_kunit_priv_init(test, priv);798798+ if (ret) {799799+ kunit_kfree(test, priv);800800+ return ret;801801+ }802802+ test->priv = priv;803803+ return 0;804804+}805805+806806+static void pt_kunit_generic_pt_exit(struct kunit *test)807807+{808808+ struct kunit_iommu_priv *priv = test->priv;809809+810810+ if (!test->priv)811811+ return;812812+813813+ pt_iommu_deinit(priv->iommu);814814+ kunit_kfree(test, test->priv);815815+}816816+817817+static struct kunit_suite NS(generic_pt_suite) = {818818+ .name = __stringify(NS(fmt_test)),819819+ .init = pt_kunit_generic_pt_init,820820+ .exit = pt_kunit_generic_pt_exit,821821+ .test_cases = generic_pt_test_cases,822822+};823823+kunit_test_suites(&NS(generic_pt_suite));
+184
drivers/iommu/generic_pt/kunit_iommu.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ */55+#ifndef __GENERIC_PT_KUNIT_IOMMU_H66+#define __GENERIC_PT_KUNIT_IOMMU_H77+88+#define GENERIC_PT_KUNIT 199+#include <kunit/device.h>1010+#include <kunit/test.h>1111+#include "../iommu-pages.h"1212+#include "pt_iter.h"1313+1414+#define pt_iommu_table_cfg CONCATENATE(pt_iommu_table, _cfg)1515+#define pt_iommu_init CONCATENATE(CONCATENATE(pt_iommu_, PTPFX), init)1616+int pt_iommu_init(struct pt_iommu_table *fmt_table,1717+ const struct pt_iommu_table_cfg *cfg, gfp_t gfp);1818+1919+/* The format can provide a list of configurations it would like to test */2020+#ifdef kunit_fmt_cfgs2121+static const void *kunit_pt_gen_params_cfg(struct kunit *test, const void *prev,2222+ char *desc)2323+{2424+ uintptr_t cfg_id = (uintptr_t)prev;2525+2626+ cfg_id++;2727+ if (cfg_id >= ARRAY_SIZE(kunit_fmt_cfgs) + 1)2828+ return NULL;2929+ snprintf(desc, KUNIT_PARAM_DESC_SIZE, "%s_cfg_%u",3030+ __stringify(PTPFX_RAW), (unsigned int)(cfg_id - 1));3131+ return (void *)cfg_id;3232+}3333+#define KUNIT_CASE_FMT(test_name) \3434+ KUNIT_CASE_PARAM(test_name, kunit_pt_gen_params_cfg)3535+#else3636+#define KUNIT_CASE_FMT(test_name) KUNIT_CASE(test_name)3737+#endif3838+3939+#define KUNIT_ASSERT_NO_ERRNO(test, ret) \4040+ KUNIT_ASSERT_EQ_MSG(test, ret, 0, KUNIT_SUBSUBTEST_INDENT "errno %pe", \4141+ ERR_PTR(ret))4242+4343+#define KUNIT_ASSERT_NO_ERRNO_FN(test, fn, ret) \4444+ KUNIT_ASSERT_EQ_MSG(test, ret, 0, \4545+ KUNIT_SUBSUBTEST_INDENT "errno %pe from %s", \4646+ ERR_PTR(ret), fn)4747+4848+/*4949+ * When the test is run on a 32 bit system unsigned long can be 32 bits. This5050+ * cause the iommu op signatures to be restricted to 32 bits. Meaning the test5151+ * has to be mindful not to create any VA's over the 32 bit limit. Reduce the5252+ * scope of the testing as the main purpose of checking on full 32 bit is to5353+ * look for 32bitism in the core code. Run the test on i386 with X86_PAE=y to5454+ * get the full coverage when dma_addr_t & phys_addr_t are 8 bytes5555+ */5656+#define IS_32BIT (sizeof(unsigned long) == 4)5757+5858+struct kunit_iommu_priv {5959+ union {6060+ struct iommu_domain domain;6161+ struct pt_iommu_table fmt_table;6262+ };6363+ spinlock_t top_lock;6464+ struct device *dummy_dev;6565+ struct pt_iommu *iommu;6666+ struct pt_common *common;6767+ struct pt_iommu_table_cfg cfg;6868+ struct pt_iommu_info info;6969+ unsigned int smallest_pgsz_lg2;7070+ pt_vaddr_t smallest_pgsz;7171+ unsigned int largest_pgsz_lg2;7272+ pt_oaddr_t test_oa;7373+ pt_vaddr_t safe_pgsize_bitmap;7474+ unsigned long orig_nr_secondary_pagetable;7575+7676+};7777+PT_IOMMU_CHECK_DOMAIN(struct kunit_iommu_priv, fmt_table.iommu, domain);7878+7979+static void pt_kunit_iotlb_sync(struct iommu_domain *domain,8080+ struct iommu_iotlb_gather *gather)8181+{8282+ iommu_put_pages_list(&gather->freelist);8383+}8484+8585+#define IOMMU_PT_DOMAIN_OPS1(x) IOMMU_PT_DOMAIN_OPS(x)8686+static const struct iommu_domain_ops kunit_pt_ops = {8787+ IOMMU_PT_DOMAIN_OPS1(PTPFX_RAW),8888+ .iotlb_sync = &pt_kunit_iotlb_sync,8989+};9090+9191+static void pt_kunit_change_top(struct pt_iommu *iommu_table,9292+ phys_addr_t top_paddr, unsigned int top_level)9393+{9494+}9595+9696+static spinlock_t *pt_kunit_get_top_lock(struct pt_iommu *iommu_table)9797+{9898+ struct kunit_iommu_priv *priv = container_of(9999+ iommu_table, struct kunit_iommu_priv, fmt_table.iommu);100100+101101+ return &priv->top_lock;102102+}103103+104104+static const struct pt_iommu_driver_ops pt_kunit_driver_ops = {105105+ .change_top = &pt_kunit_change_top,106106+ .get_top_lock = &pt_kunit_get_top_lock,107107+};108108+109109+static int pt_kunit_priv_init(struct kunit *test, struct kunit_iommu_priv *priv)110110+{111111+ unsigned int va_lg2sz;112112+ int ret;113113+114114+ /* Enough so the memory allocator works */115115+ priv->dummy_dev = kunit_device_register(test, "pt_kunit_dev");116116+ if (IS_ERR(priv->dummy_dev))117117+ return PTR_ERR(priv->dummy_dev);118118+ set_dev_node(priv->dummy_dev, NUMA_NO_NODE);119119+120120+ spin_lock_init(&priv->top_lock);121121+122122+#ifdef kunit_fmt_cfgs123123+ priv->cfg = kunit_fmt_cfgs[((uintptr_t)test->param_value) - 1];124124+ /*125125+ * The format can set a list of features that the kunit_fmt_cfgs126126+ * controls, other features are default to on.127127+ */128128+ priv->cfg.common.features |= PT_SUPPORTED_FEATURES &129129+ (~KUNIT_FMT_FEATURES);130130+#else131131+ priv->cfg.common.features = PT_SUPPORTED_FEATURES;132132+#endif133133+134134+ /* Defaults, for the kunit */135135+ if (!priv->cfg.common.hw_max_vasz_lg2)136136+ priv->cfg.common.hw_max_vasz_lg2 = PT_MAX_VA_ADDRESS_LG2;137137+ if (!priv->cfg.common.hw_max_oasz_lg2)138138+ priv->cfg.common.hw_max_oasz_lg2 = pt_max_oa_lg2(NULL);139139+140140+ priv->fmt_table.iommu.nid = NUMA_NO_NODE;141141+ priv->fmt_table.iommu.driver_ops = &pt_kunit_driver_ops;142142+ priv->fmt_table.iommu.iommu_device = priv->dummy_dev;143143+ priv->domain.ops = &kunit_pt_ops;144144+ ret = pt_iommu_init(&priv->fmt_table, &priv->cfg, GFP_KERNEL);145145+ if (ret) {146146+ if (ret == -EOVERFLOW)147147+ kunit_skip(148148+ test,149149+ "This configuration cannot be tested on 32 bit");150150+ return ret;151151+ }152152+153153+ priv->iommu = &priv->fmt_table.iommu;154154+ priv->common = common_from_iommu(&priv->fmt_table.iommu);155155+ priv->iommu->ops->get_info(priv->iommu, &priv->info);156156+157157+ /*158158+ * size_t is used to pass the mapping length, it can be 32 bit, truncate159159+ * the pagesizes so we don't use large sizes.160160+ */161161+ priv->info.pgsize_bitmap = (size_t)priv->info.pgsize_bitmap;162162+163163+ priv->smallest_pgsz_lg2 = vaffs(priv->info.pgsize_bitmap);164164+ priv->smallest_pgsz = log2_to_int(priv->smallest_pgsz_lg2);165165+ priv->largest_pgsz_lg2 =166166+ vafls((dma_addr_t)priv->info.pgsize_bitmap) - 1;167167+168168+ priv->test_oa =169169+ oalog2_mod(0x74a71445deadbeef, priv->common->max_oasz_lg2);170170+171171+ /*172172+ * We run out of VA space if the mappings get too big, make something173173+ * smaller that can safely pass through dma_addr_t API.174174+ */175175+ va_lg2sz = priv->common->max_vasz_lg2;176176+ if (IS_32BIT && va_lg2sz > 32)177177+ va_lg2sz = 32;178178+ priv->safe_pgsize_bitmap =179179+ log2_mod(priv->info.pgsize_bitmap, va_lg2sz - 1);180180+181181+ return 0;182182+}183183+184184+#endif
+487
drivers/iommu/generic_pt/kunit_iommu_pt.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES44+ */55+#include "kunit_iommu.h"66+#include "pt_iter.h"77+#include <linux/generic_pt/iommu.h>88+#include <linux/iommu.h>99+1010+static void do_map(struct kunit *test, pt_vaddr_t va, pt_oaddr_t pa,1111+ pt_vaddr_t len);1212+1313+struct count_valids {1414+ u64 per_size[PT_VADDR_MAX_LG2];1515+};1616+1717+static int __count_valids(struct pt_range *range, void *arg, unsigned int level,1818+ struct pt_table_p *table)1919+{2020+ struct pt_state pts = pt_init(range, level, table);2121+ struct count_valids *valids = arg;2222+2323+ for_each_pt_level_entry(&pts) {2424+ if (pts.type == PT_ENTRY_TABLE) {2525+ pt_descend(&pts, arg, __count_valids);2626+ continue;2727+ }2828+ if (pts.type == PT_ENTRY_OA) {2929+ valids->per_size[pt_entry_oa_lg2sz(&pts)]++;3030+ continue;3131+ }3232+ }3333+ return 0;3434+}3535+3636+/*3737+ * Number of valid table entries. This counts contiguous entries as a single3838+ * valid.3939+ */4040+static unsigned int count_valids(struct kunit *test)4141+{4242+ struct kunit_iommu_priv *priv = test->priv;4343+ struct pt_range range = pt_top_range(priv->common);4444+ struct count_valids valids = {};4545+ u64 total = 0;4646+ unsigned int i;4747+4848+ KUNIT_ASSERT_NO_ERRNO(test,4949+ pt_walk_range(&range, __count_valids, &valids));5050+5151+ for (i = 0; i != ARRAY_SIZE(valids.per_size); i++)5252+ total += valids.per_size[i];5353+ return total;5454+}5555+5656+/* Only a single page size is present, count the number of valid entries */5757+static unsigned int count_valids_single(struct kunit *test, pt_vaddr_t pgsz)5858+{5959+ struct kunit_iommu_priv *priv = test->priv;6060+ struct pt_range range = pt_top_range(priv->common);6161+ struct count_valids valids = {};6262+ u64 total = 0;6363+ unsigned int i;6464+6565+ KUNIT_ASSERT_NO_ERRNO(test,6666+ pt_walk_range(&range, __count_valids, &valids));6767+6868+ for (i = 0; i != ARRAY_SIZE(valids.per_size); i++) {6969+ if ((1ULL << i) == pgsz)7070+ total = valids.per_size[i];7171+ else7272+ KUNIT_ASSERT_EQ(test, valids.per_size[i], 0);7373+ }7474+ return total;7575+}7676+7777+static void do_unmap(struct kunit *test, pt_vaddr_t va, pt_vaddr_t len)7878+{7979+ struct kunit_iommu_priv *priv = test->priv;8080+ size_t ret;8181+8282+ ret = iommu_unmap(&priv->domain, va, len);8383+ KUNIT_ASSERT_EQ(test, ret, len);8484+}8585+8686+static void check_iova(struct kunit *test, pt_vaddr_t va, pt_oaddr_t pa,8787+ pt_vaddr_t len)8888+{8989+ struct kunit_iommu_priv *priv = test->priv;9090+ pt_vaddr_t pfn = log2_div(va, priv->smallest_pgsz_lg2);9191+ pt_vaddr_t end_pfn = pfn + log2_div(len, priv->smallest_pgsz_lg2);9292+9393+ for (; pfn != end_pfn; pfn++) {9494+ phys_addr_t res = iommu_iova_to_phys(&priv->domain,9595+ pfn * priv->smallest_pgsz);9696+9797+ KUNIT_ASSERT_EQ(test, res, (phys_addr_t)pa);9898+ if (res != pa)9999+ break;100100+ pa += priv->smallest_pgsz;101101+ }102102+}103103+104104+static void test_increase_level(struct kunit *test)105105+{106106+ struct kunit_iommu_priv *priv = test->priv;107107+ struct pt_common *common = priv->common;108108+109109+ if (!pt_feature(common, PT_FEAT_DYNAMIC_TOP))110110+ kunit_skip(test, "PT_FEAT_DYNAMIC_TOP not set for this format");111111+112112+ if (IS_32BIT)113113+ kunit_skip(test, "Unable to test on 32bit");114114+115115+ KUNIT_ASSERT_GT(test, common->max_vasz_lg2,116116+ pt_top_range(common).max_vasz_lg2);117117+118118+ /* Add every possible level to the max */119119+ while (common->max_vasz_lg2 != pt_top_range(common).max_vasz_lg2) {120120+ struct pt_range top_range = pt_top_range(common);121121+122122+ if (top_range.va == 0)123123+ do_map(test, top_range.last_va + 1, 0,124124+ priv->smallest_pgsz);125125+ else126126+ do_map(test, top_range.va - priv->smallest_pgsz, 0,127127+ priv->smallest_pgsz);128128+129129+ KUNIT_ASSERT_EQ(test, pt_top_range(common).top_level,130130+ top_range.top_level + 1);131131+ KUNIT_ASSERT_GE(test, common->max_vasz_lg2,132132+ pt_top_range(common).max_vasz_lg2);133133+ }134134+}135135+136136+static void test_map_simple(struct kunit *test)137137+{138138+ struct kunit_iommu_priv *priv = test->priv;139139+ struct pt_range range = pt_top_range(priv->common);140140+ struct count_valids valids = {};141141+ pt_vaddr_t pgsize_bitmap = priv->safe_pgsize_bitmap;142142+ unsigned int pgsz_lg2;143143+ pt_vaddr_t cur_va;144144+145145+ /* Map every reported page size */146146+ cur_va = range.va + priv->smallest_pgsz * 256;147147+ for (pgsz_lg2 = 0; pgsz_lg2 != PT_VADDR_MAX_LG2; pgsz_lg2++) {148148+ pt_oaddr_t paddr = log2_set_mod(priv->test_oa, 0, pgsz_lg2);149149+ u64 len = log2_to_int(pgsz_lg2);150150+151151+ if (!(pgsize_bitmap & len))152152+ continue;153153+154154+ cur_va = ALIGN(cur_va, len);155155+ do_map(test, cur_va, paddr, len);156156+ if (len <= SZ_2G)157157+ check_iova(test, cur_va, paddr, len);158158+ cur_va += len;159159+ }160160+161161+ /* The read interface reports that every page size was created */162162+ range = pt_top_range(priv->common);163163+ KUNIT_ASSERT_NO_ERRNO(test,164164+ pt_walk_range(&range, __count_valids, &valids));165165+ for (pgsz_lg2 = 0; pgsz_lg2 != PT_VADDR_MAX_LG2; pgsz_lg2++) {166166+ if (pgsize_bitmap & (1ULL << pgsz_lg2))167167+ KUNIT_ASSERT_EQ(test, valids.per_size[pgsz_lg2], 1);168168+ else169169+ KUNIT_ASSERT_EQ(test, valids.per_size[pgsz_lg2], 0);170170+ }171171+172172+ /* Unmap works */173173+ range = pt_top_range(priv->common);174174+ cur_va = range.va + priv->smallest_pgsz * 256;175175+ for (pgsz_lg2 = 0; pgsz_lg2 != PT_VADDR_MAX_LG2; pgsz_lg2++) {176176+ u64 len = log2_to_int(pgsz_lg2);177177+178178+ if (!(pgsize_bitmap & len))179179+ continue;180180+ cur_va = ALIGN(cur_va, len);181181+ do_unmap(test, cur_va, len);182182+ cur_va += len;183183+ }184184+ KUNIT_ASSERT_EQ(test, count_valids(test), 0);185185+}186186+187187+/*188188+ * Test to convert a table pointer into an OA by mapping something small,189189+ * unmapping it so as to leave behind a table pointer, then mapping something190190+ * larger that will convert the table into an OA.191191+ */192192+static void test_map_table_to_oa(struct kunit *test)193193+{194194+ struct kunit_iommu_priv *priv = test->priv;195195+ pt_vaddr_t limited_pgbitmap =196196+ priv->info.pgsize_bitmap % (IS_32BIT ? SZ_2G : SZ_16G);197197+ struct pt_range range = pt_top_range(priv->common);198198+ unsigned int pgsz_lg2;199199+ pt_vaddr_t max_pgsize;200200+ pt_vaddr_t cur_va;201201+202202+ max_pgsize = 1ULL << (vafls(limited_pgbitmap) - 1);203203+ KUNIT_ASSERT_TRUE(test, priv->info.pgsize_bitmap & max_pgsize);204204+205205+ for (pgsz_lg2 = 0; pgsz_lg2 != PT_VADDR_MAX_LG2; pgsz_lg2++) {206206+ pt_oaddr_t paddr = log2_set_mod(priv->test_oa, 0, pgsz_lg2);207207+ u64 len = log2_to_int(pgsz_lg2);208208+ pt_vaddr_t offset;209209+210210+ if (!(priv->info.pgsize_bitmap & len))211211+ continue;212212+ if (len > max_pgsize)213213+ break;214214+215215+ cur_va = ALIGN(range.va + priv->smallest_pgsz * 256,216216+ max_pgsize);217217+ for (offset = 0; offset != max_pgsize; offset += len)218218+ do_map(test, cur_va + offset, paddr + offset, len);219219+ check_iova(test, cur_va, paddr, max_pgsize);220220+ KUNIT_ASSERT_EQ(test, count_valids_single(test, len),221221+ log2_div(max_pgsize, pgsz_lg2));222222+223223+ if (len == max_pgsize) {224224+ do_unmap(test, cur_va, max_pgsize);225225+ } else {226226+ do_unmap(test, cur_va, max_pgsize / 2);227227+ for (offset = max_pgsize / 2; offset != max_pgsize;228228+ offset += len)229229+ do_unmap(test, cur_va + offset, len);230230+ }231231+232232+ KUNIT_ASSERT_EQ(test, count_valids(test), 0);233233+ }234234+}235235+236236+/*237237+ * Test unmapping a small page at the start of a large page. This always unmaps238238+ * the large page.239239+ */240240+static void test_unmap_split(struct kunit *test)241241+{242242+ struct kunit_iommu_priv *priv = test->priv;243243+ struct pt_range top_range = pt_top_range(priv->common);244244+ pt_vaddr_t pgsize_bitmap = priv->safe_pgsize_bitmap;245245+ unsigned int pgsz_lg2;246246+ unsigned int count = 0;247247+248248+ for (pgsz_lg2 = 0; pgsz_lg2 != PT_VADDR_MAX_LG2; pgsz_lg2++) {249249+ pt_vaddr_t base_len = log2_to_int(pgsz_lg2);250250+ unsigned int next_pgsz_lg2;251251+252252+ if (!(pgsize_bitmap & base_len))253253+ continue;254254+255255+ for (next_pgsz_lg2 = pgsz_lg2 + 1;256256+ next_pgsz_lg2 != PT_VADDR_MAX_LG2; next_pgsz_lg2++) {257257+ pt_vaddr_t next_len = log2_to_int(next_pgsz_lg2);258258+ pt_vaddr_t vaddr = top_range.va;259259+ pt_oaddr_t paddr = 0;260260+ size_t gnmapped;261261+262262+ if (!(pgsize_bitmap & next_len))263263+ continue;264264+265265+ do_map(test, vaddr, paddr, next_len);266266+ gnmapped = iommu_unmap(&priv->domain, vaddr, base_len);267267+ KUNIT_ASSERT_EQ(test, gnmapped, next_len);268268+269269+ /* Make sure unmap doesn't keep going */270270+ do_map(test, vaddr, paddr, next_len);271271+ do_map(test, vaddr + next_len, paddr, next_len);272272+ gnmapped = iommu_unmap(&priv->domain, vaddr, base_len);273273+ KUNIT_ASSERT_EQ(test, gnmapped, next_len);274274+ gnmapped = iommu_unmap(&priv->domain, vaddr + next_len,275275+ next_len);276276+ KUNIT_ASSERT_EQ(test, gnmapped, next_len);277277+278278+ count++;279279+ }280280+ }281281+282282+ if (count == 0)283283+ kunit_skip(test, "Test needs two page sizes");284284+}285285+286286+static void unmap_collisions(struct kunit *test, struct maple_tree *mt,287287+ pt_vaddr_t start, pt_vaddr_t last)288288+{289289+ struct kunit_iommu_priv *priv = test->priv;290290+ MA_STATE(mas, mt, start, last);291291+ void *entry;292292+293293+ mtree_lock(mt);294294+ mas_for_each(&mas, entry, last) {295295+ pt_vaddr_t mas_start = mas.index;296296+ pt_vaddr_t len = (mas.last - mas_start) + 1;297297+ pt_oaddr_t paddr;298298+299299+ mas_erase(&mas);300300+ mas_pause(&mas);301301+ mtree_unlock(mt);302302+303303+ paddr = oalog2_mod(mas_start, priv->common->max_oasz_lg2);304304+ check_iova(test, mas_start, paddr, len);305305+ do_unmap(test, mas_start, len);306306+ mtree_lock(mt);307307+ }308308+ mtree_unlock(mt);309309+}310310+311311+static void clamp_range(struct kunit *test, struct pt_range *range)312312+{313313+ struct kunit_iommu_priv *priv = test->priv;314314+315315+ if (range->last_va - range->va > SZ_1G)316316+ range->last_va = range->va + SZ_1G;317317+ KUNIT_ASSERT_NE(test, range->last_va, PT_VADDR_MAX);318318+ if (range->va <= MAPLE_RESERVED_RANGE)319319+ range->va =320320+ ALIGN(MAPLE_RESERVED_RANGE, priv->smallest_pgsz);321321+}322322+323323+/*324324+ * Randomly map and unmap ranges that can large physical pages. If a random325325+ * range overlaps with existing ranges then unmap them. This hits all the326326+ * special cases.327327+ */328328+static void test_random_map(struct kunit *test)329329+{330330+ struct kunit_iommu_priv *priv = test->priv;331331+ struct pt_range upper_range = pt_upper_range(priv->common);332332+ struct pt_range top_range = pt_top_range(priv->common);333333+ struct maple_tree mt;334334+ unsigned int iter;335335+336336+ mt_init(&mt);337337+338338+ /*339339+ * Shrink the range so randomization is more likely to have340340+ * intersections341341+ */342342+ clamp_range(test, &top_range);343343+ clamp_range(test, &upper_range);344344+345345+ for (iter = 0; iter != 1000; iter++) {346346+ struct pt_range *range = &top_range;347347+ pt_oaddr_t paddr;348348+ pt_vaddr_t start;349349+ pt_vaddr_t end;350350+ int ret;351351+352352+ if (pt_feature(priv->common, PT_FEAT_SIGN_EXTEND) &&353353+ ULONG_MAX >= PT_VADDR_MAX && get_random_u32_inclusive(0, 1))354354+ range = &upper_range;355355+356356+ start = get_random_u32_below(357357+ min(U32_MAX, range->last_va - range->va));358358+ end = get_random_u32_below(359359+ min(U32_MAX, range->last_va - start));360360+361361+ start = ALIGN_DOWN(start, priv->smallest_pgsz);362362+ end = ALIGN(end, priv->smallest_pgsz);363363+ start += range->va;364364+ end += start;365365+ if (start < range->va || end > range->last_va + 1 ||366366+ start >= end)367367+ continue;368368+369369+ /* Try overmapping to test the failure handling */370370+ paddr = oalog2_mod(start, priv->common->max_oasz_lg2);371371+ ret = iommu_map(&priv->domain, start, paddr, end - start,372372+ IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);373373+ if (ret) {374374+ KUNIT_ASSERT_EQ(test, ret, -EADDRINUSE);375375+ unmap_collisions(test, &mt, start, end - 1);376376+ do_map(test, start, paddr, end - start);377377+ }378378+379379+ KUNIT_ASSERT_NO_ERRNO_FN(test, "mtree_insert_range",380380+ mtree_insert_range(&mt, start, end - 1,381381+ XA_ZERO_ENTRY,382382+ GFP_KERNEL));383383+384384+ check_iova(test, start, paddr, end - start);385385+ if (iter % 100)386386+ cond_resched();387387+ }388388+389389+ unmap_collisions(test, &mt, 0, PT_VADDR_MAX);390390+ KUNIT_ASSERT_EQ(test, count_valids(test), 0);391391+392392+ mtree_destroy(&mt);393393+}394394+395395+/* See https://lore.kernel.org/r/b9b18a03-63a2-4065-a27e-d92dd5c860bc@amd.com */396396+static void test_pgsize_boundary(struct kunit *test)397397+{398398+ struct kunit_iommu_priv *priv = test->priv;399399+ struct pt_range top_range = pt_top_range(priv->common);400400+401401+ if (top_range.va != 0 || top_range.last_va < 0xfef9ffff ||402402+ priv->smallest_pgsz != SZ_4K)403403+ kunit_skip(test, "Format does not have the required range");404404+405405+ do_map(test, 0xfef80000, 0x208b95d000, 0xfef9ffff - 0xfef80000 + 1);406406+}407407+408408+/* See https://lore.kernel.org/r/20250826143816.38686-1-eugkoira@amazon.com */409409+static void test_mixed(struct kunit *test)410410+{411411+ struct kunit_iommu_priv *priv = test->priv;412412+ struct pt_range top_range = pt_top_range(priv->common);413413+ u64 start = 0x3fe400ULL << 12;414414+ u64 end = 0x4c0600ULL << 12;415415+ pt_vaddr_t len = end - start;416416+ pt_oaddr_t oa = start;417417+418418+ if (top_range.last_va <= start || sizeof(unsigned long) == 4)419419+ kunit_skip(test, "range is too small");420420+ if ((priv->safe_pgsize_bitmap & GENMASK(30, 21)) != (BIT(30) | BIT(21)))421421+ kunit_skip(test, "incompatible psize");422422+423423+ do_map(test, start, oa, len);424424+ /* 14 2M, 3 1G, 3 2M */425425+ KUNIT_ASSERT_EQ(test, count_valids(test), 20);426426+ check_iova(test, start, oa, len);427427+}428428+429429+static struct kunit_case iommu_test_cases[] = {430430+ KUNIT_CASE_FMT(test_increase_level),431431+ KUNIT_CASE_FMT(test_map_simple),432432+ KUNIT_CASE_FMT(test_map_table_to_oa),433433+ KUNIT_CASE_FMT(test_unmap_split),434434+ KUNIT_CASE_FMT(test_random_map),435435+ KUNIT_CASE_FMT(test_pgsize_boundary),436436+ KUNIT_CASE_FMT(test_mixed),437437+ {},438438+};439439+440440+static int pt_kunit_iommu_init(struct kunit *test)441441+{442442+ struct kunit_iommu_priv *priv;443443+ int ret;444444+445445+ priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL);446446+ if (!priv)447447+ return -ENOMEM;448448+449449+ priv->orig_nr_secondary_pagetable =450450+ global_node_page_state(NR_SECONDARY_PAGETABLE);451451+ ret = pt_kunit_priv_init(test, priv);452452+ if (ret) {453453+ kunit_kfree(test, priv);454454+ return ret;455455+ }456456+ test->priv = priv;457457+ return 0;458458+}459459+460460+static void pt_kunit_iommu_exit(struct kunit *test)461461+{462462+ struct kunit_iommu_priv *priv = test->priv;463463+464464+ if (!test->priv)465465+ return;466466+467467+ pt_iommu_deinit(priv->iommu);468468+ /*469469+ * Look for memory leaks, assumes kunit is running isolated and nothing470470+ * else is using secondary page tables.471471+ */472472+ KUNIT_ASSERT_EQ(test, priv->orig_nr_secondary_pagetable,473473+ global_node_page_state(NR_SECONDARY_PAGETABLE));474474+ kunit_kfree(test, test->priv);475475+}476476+477477+static struct kunit_suite NS(iommu_suite) = {478478+ .name = __stringify(NS(iommu_test)),479479+ .init = pt_kunit_iommu_init,480480+ .exit = pt_kunit_iommu_exit,481481+ .test_cases = iommu_test_cases,482482+};483483+kunit_test_suites(&NS(iommu_suite));484484+485485+MODULE_LICENSE("GPL");486486+MODULE_DESCRIPTION("Kunit for generic page table");487487+MODULE_IMPORT_NS("GENERIC_PT_IOMMU");
+389
drivers/iommu/generic_pt/pt_common.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * This header is included after the format. It contains definitions66+ * that build on the format definitions to create the basic format API.77+ *88+ * The format API is listed here, with kdocs. The functions without bodies are99+ * implemented in the format using the pattern:1010+ * static inline FMTpt_XXX(..) {..}1111+ * #define pt_XXX FMTpt_XXX1212+ *1313+ * If the format doesn't implement a function then pt_fmt_defaults.h can provide1414+ * a generic version.1515+ *1616+ * The routines marked "@pts: Entry to query" operate on the entire contiguous1717+ * entry and can be called with a pts->index pointing to any sub item that makes1818+ * up that entry.1919+ *2020+ * The header order is:2121+ * pt_defs.h2222+ * FMT.h2323+ * pt_common.h2424+ */2525+#ifndef __GENERIC_PT_PT_COMMON_H2626+#define __GENERIC_PT_PT_COMMON_H2727+2828+#include "pt_defs.h"2929+#include "pt_fmt_defaults.h"3030+3131+/**3232+ * pt_attr_from_entry() - Convert the permission bits back to attrs3333+ * @pts: Entry to convert from3434+ * @attrs: Resulting attrs3535+ *3636+ * Fill in the attrs with the permission bits encoded in the current leaf entry.3737+ * The attrs should be usable with pt_install_leaf_entry() to reconstruct the3838+ * same entry.3939+ */4040+static inline void pt_attr_from_entry(const struct pt_state *pts,4141+ struct pt_write_attrs *attrs);4242+4343+/**4444+ * pt_can_have_leaf() - True if the current level can have an OA entry4545+ * @pts: The current level4646+ *4747+ * True if the current level can support pt_install_leaf_entry(). A leaf4848+ * entry produce an OA.4949+ */5050+static inline bool pt_can_have_leaf(const struct pt_state *pts);5151+5252+/**5353+ * pt_can_have_table() - True if the current level can have a lower table5454+ * @pts: The current level5555+ *5656+ * Every level except 0 is allowed to have a lower table.5757+ */5858+static inline bool pt_can_have_table(const struct pt_state *pts)5959+{6060+ /* No further tables at level 0 */6161+ return pts->level > 0;6262+}6363+6464+/**6565+ * pt_clear_entries() - Make entries empty (non-present)6666+ * @pts: Starting table index6767+ * @num_contig_lg2: Number of contiguous items to clear6868+ *6969+ * Clear a run of entries. A cleared entry will load back as PT_ENTRY_EMPTY7070+ * and does not have any effect on table walking. The starting index must be7171+ * aligned to num_contig_lg2.7272+ */7373+static inline void pt_clear_entries(struct pt_state *pts,7474+ unsigned int num_contig_lg2);7575+7676+/**7777+ * pt_entry_make_write_dirty() - Make an entry dirty7878+ * @pts: Table entry to change7979+ *8080+ * Make pt_entry_is_write_dirty() return true for this entry. This can be called8181+ * asynchronously with any other table manipulation under a RCU lock and must8282+ * not corrupt the table.8383+ */8484+static inline bool pt_entry_make_write_dirty(struct pt_state *pts);8585+8686+/**8787+ * pt_entry_make_write_clean() - Make the entry write clean8888+ * @pts: Table entry to change8989+ *9090+ * Modify the entry so that pt_entry_is_write_dirty() == false. The HW will9191+ * eventually be notified of this change via a TLB flush, which is the point9292+ * that the HW must become synchronized. Any "write dirty" prior to the TLB9393+ * flush can be lost, but once the TLB flush completes all writes must make9494+ * their entries write dirty.9595+ *9696+ * The format should alter the entry in a way that is compatible with any9797+ * concurrent update from HW. The entire contiguous entry is changed.9898+ */9999+static inline void pt_entry_make_write_clean(struct pt_state *pts);100100+101101+/**102102+ * pt_entry_is_write_dirty() - True if the entry has been written to103103+ * @pts: Entry to query104104+ *105105+ * "write dirty" means that the HW has written to the OA translated106106+ * by this entry. If the entry is contiguous then the consolidated107107+ * "write dirty" for all the items must be returned.108108+ */109109+static inline bool pt_entry_is_write_dirty(const struct pt_state *pts);110110+111111+/**112112+ * pt_dirty_supported() - True if the page table supports dirty tracking113113+ * @common: Page table to query114114+ */115115+static inline bool pt_dirty_supported(struct pt_common *common);116116+117117+/**118118+ * pt_entry_num_contig_lg2() - Number of contiguous items for this leaf entry119119+ * @pts: Entry to query120120+ *121121+ * Return the number of contiguous items this leaf entry spans. If the entry122122+ * is single item it returns ilog2(1).123123+ */124124+static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts);125125+126126+/**127127+ * pt_entry_oa() - Output Address for this leaf entry128128+ * @pts: Entry to query129129+ *130130+ * Return the output address for the start of the entry. If the entry131131+ * is contiguous this returns the same value for each sub-item. I.e.::132132+ *133133+ * log2_mod(pt_entry_oa(), pt_entry_oa_lg2sz()) == 0134134+ *135135+ * See pt_item_oa(). The format should implement one of these two functions136136+ * depending on how it stores the OAs in the table.137137+ */138138+static inline pt_oaddr_t pt_entry_oa(const struct pt_state *pts);139139+140140+/**141141+ * pt_entry_oa_lg2sz() - Return the size of an OA entry142142+ * @pts: Entry to query143143+ *144144+ * If the entry is not contiguous this returns pt_table_item_lg2sz(), otherwise145145+ * it returns the total VA/OA size of the entire contiguous entry.146146+ */147147+static inline unsigned int pt_entry_oa_lg2sz(const struct pt_state *pts)148148+{149149+ return pt_entry_num_contig_lg2(pts) + pt_table_item_lg2sz(pts);150150+}151151+152152+/**153153+ * pt_entry_oa_exact() - Return the complete OA for an entry154154+ * @pts: Entry to query155155+ *156156+ * During iteration the first entry could have a VA with an offset from the157157+ * natural start of the entry. Return the exact OA including the pts's VA158158+ * offset.159159+ */160160+static inline pt_oaddr_t pt_entry_oa_exact(const struct pt_state *pts)161161+{162162+ return _pt_entry_oa_fast(pts) |163163+ log2_mod(pts->range->va, pt_entry_oa_lg2sz(pts));164164+}165165+166166+/**167167+ * pt_full_va_prefix() - The top bits of the VA168168+ * @common: Page table to query169169+ *170170+ * This is usually 0, but some formats have their VA space going downward from171171+ * PT_VADDR_MAX, and will return that instead. This value must always be172172+ * adjusted by struct pt_common max_vasz_lg2.173173+ */174174+static inline pt_vaddr_t pt_full_va_prefix(const struct pt_common *common);175175+176176+/**177177+ * pt_has_system_page_size() - True if level 0 can install a PAGE_SHIFT entry178178+ * @common: Page table to query179179+ *180180+ * If true the caller can use, at level 0, pt_install_leaf_entry(PAGE_SHIFT).181181+ * This is useful to create optimized paths for common cases of PAGE_SIZE182182+ * mappings.183183+ */184184+static inline bool pt_has_system_page_size(const struct pt_common *common);185185+186186+/**187187+ * pt_install_leaf_entry() - Write a leaf entry to the table188188+ * @pts: Table index to change189189+ * @oa: Output Address for this leaf190190+ * @oasz_lg2: Size in VA/OA for this leaf191191+ * @attrs: Attributes to modify the entry192192+ *193193+ * A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It translates194194+ * the VA indicated by pts to the given OA.195195+ *196196+ * For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().197197+ * For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.198198+ *199199+ * This must not be called if pt_can_have_leaf() == false. Contiguous sizes200200+ * not indicated by pt_possible_sizes() must not be specified.201201+ */202202+static inline void pt_install_leaf_entry(struct pt_state *pts, pt_oaddr_t oa,203203+ unsigned int oasz_lg2,204204+ const struct pt_write_attrs *attrs);205205+206206+/**207207+ * pt_install_table() - Write a table entry to the table208208+ * @pts: Table index to change209209+ * @table_pa: CPU physical address of the lower table's memory210210+ * @attrs: Attributes to modify the table index211211+ *212212+ * A table entry will return PT_ENTRY_TABLE from pt_load_entry(). The table_pa213213+ * is the table at pts->level - 1. This is done by cmpxchg so pts must have the214214+ * current entry loaded. The pts is updated with the installed entry.215215+ *216216+ * This must not be called if pt_can_have_table() == false.217217+ *218218+ * Returns: true if the table was installed successfully.219219+ */220220+static inline bool pt_install_table(struct pt_state *pts, pt_oaddr_t table_pa,221221+ const struct pt_write_attrs *attrs);222222+223223+/**224224+ * pt_item_oa() - Output Address for this leaf item225225+ * @pts: Item to query226226+ *227227+ * Return the output address for this item. If the item is part of a contiguous228228+ * entry it returns the value of the OA for this individual sub item.229229+ *230230+ * See pt_entry_oa(). The format should implement one of these two functions231231+ * depending on how it stores the OA's in the table.232232+ */233233+static inline pt_oaddr_t pt_item_oa(const struct pt_state *pts);234234+235235+/**236236+ * pt_load_entry_raw() - Read from the location pts points at into the pts237237+ * @pts: Table index to load238238+ *239239+ * Return the type of entry that was loaded. pts->entry will be filled in with240240+ * the entry's content. See pt_load_entry()241241+ */242242+static inline enum pt_entry_type pt_load_entry_raw(struct pt_state *pts);243243+244244+/**245245+ * pt_max_oa_lg2() - Return the maximum OA the table format can hold246246+ * @common: Page table to query247247+ *248248+ * The value oalog2_to_max_int(pt_max_oa_lg2()) is the MAX for the249249+ * OA. This is the absolute maximum address the table can hold. struct pt_common250250+ * max_oasz_lg2 sets a lower dynamic maximum based on HW capability.251251+ */252252+static inline unsigned int253253+pt_max_oa_lg2(const struct pt_common *common);254254+255255+/**256256+ * pt_num_items_lg2() - Return the number of items in this table level257257+ * @pts: The current level258258+ *259259+ * The number of items in a table level defines the number of bits this level260260+ * decodes from the VA. This function is not called for the top level,261261+ * so it does not need to compute a special value for the top case. The262262+ * result for the top is based on pt_common max_vasz_lg2.263263+ *264264+ * The value is used as part of determining the table indexes via the265265+ * equation::266266+ *267267+ * log2_mod(log2_div(VA, pt_table_item_lg2sz()), pt_num_items_lg2())268268+ */269269+static inline unsigned int pt_num_items_lg2(const struct pt_state *pts);270270+271271+/**272272+ * pt_pgsz_lg2_to_level - Return the level that maps the page size273273+ * @common: Page table to query274274+ * @pgsize_lg2: Log2 page size275275+ *276276+ * Returns the table level that will map the given page size. The page277277+ * size must be part of the pt_possible_sizes() for some level.278278+ */279279+static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,280280+ unsigned int pgsize_lg2);281281+282282+/**283283+ * pt_possible_sizes() - Return a bitmap of possible output sizes at this level284284+ * @pts: The current level285285+ *286286+ * Each level has a list of possible output sizes that can be installed as287287+ * leaf entries. If pt_can_have_leaf() is false returns zero.288288+ *289289+ * Otherwise the bit in position pt_table_item_lg2sz() should be set indicating290290+ * that a non-contiguous single item leaf entry is supported. The following291291+ * pt_num_items_lg2() number of bits can be set indicating contiguous entries292292+ * are supported. Bit pt_table_item_lg2sz() + pt_num_items_lg2() must not be293293+ * set, contiguous entries cannot span the entire table.294294+ *295295+ * The OR of pt_possible_sizes() of all levels is the typical bitmask of all296296+ * supported sizes in the entire table.297297+ */298298+static inline pt_vaddr_t pt_possible_sizes(const struct pt_state *pts);299299+300300+/**301301+ * pt_table_item_lg2sz() - Size of a single item entry in this table level302302+ * @pts: The current level303303+ *304304+ * The size of the item specifies how much VA and OA a single item occupies.305305+ *306306+ * See pt_entry_oa_lg2sz() for the same value including the effect of contiguous307307+ * entries.308308+ */309309+static inline unsigned int pt_table_item_lg2sz(const struct pt_state *pts);310310+311311+/**312312+ * pt_table_oa_lg2sz() - Return the VA/OA size of the entire table313313+ * @pts: The current level314314+ *315315+ * Return the size of VA decoded by the entire table level.316316+ */317317+static inline unsigned int pt_table_oa_lg2sz(const struct pt_state *pts)318318+{319319+ if (pts->range->top_level == pts->level)320320+ return pts->range->max_vasz_lg2;321321+ return min_t(unsigned int, pts->range->common->max_vasz_lg2,322322+ pt_num_items_lg2(pts) + pt_table_item_lg2sz(pts));323323+}324324+325325+/**326326+ * pt_table_pa() - Return the CPU physical address of the table entry327327+ * @pts: Entry to query328328+ *329329+ * This is only ever called on PT_ENTRY_TABLE entries. Must return the same330330+ * value passed to pt_install_table().331331+ */332332+static inline pt_oaddr_t pt_table_pa(const struct pt_state *pts);333333+334334+/**335335+ * pt_table_ptr() - Return a CPU pointer for a table item336336+ * @pts: Entry to query337337+ *338338+ * Same as pt_table_pa() but returns a CPU pointer.339339+ */340340+static inline struct pt_table_p *pt_table_ptr(const struct pt_state *pts)341341+{342342+ return __va(pt_table_pa(pts));343343+}344344+345345+/**346346+ * pt_max_sw_bit() - Return the maximum software bit usable for any level and347347+ * entry348348+ * @common: Page table349349+ *350350+ * The swbit can be passed as bitnr to the other sw_bit functions.351351+ */352352+static inline unsigned int pt_max_sw_bit(struct pt_common *common);353353+354354+/**355355+ * pt_test_sw_bit_acquire() - Read a software bit in an item356356+ * @pts: Entry to read357357+ * @bitnr: Bit to read358358+ *359359+ * Software bits are ignored by HW and can be used for any purpose by the360360+ * software. This does a test bit and acquire operation.361361+ */362362+static inline bool pt_test_sw_bit_acquire(struct pt_state *pts,363363+ unsigned int bitnr);364364+365365+/**366366+ * pt_set_sw_bit_release() - Set a software bit in an item367367+ * @pts: Entry to set368368+ * @bitnr: Bit to set369369+ *370370+ * Software bits are ignored by HW and can be used for any purpose by the371371+ * software. This does a set bit and release operation.372372+ */373373+static inline void pt_set_sw_bit_release(struct pt_state *pts,374374+ unsigned int bitnr);375375+376376+/**377377+ * pt_load_entry() - Read from the location pts points at into the pts378378+ * @pts: Table index to load379379+ *380380+ * Set the type of entry that was loaded. pts->entry and pts->table_lower381381+ * will be filled in with the entry's content.382382+ */383383+static inline void pt_load_entry(struct pt_state *pts)384384+{385385+ pts->type = pt_load_entry_raw(pts);386386+ if (pts->type == PT_ENTRY_TABLE)387387+ pts->table_lower = pt_table_ptr(pts);388388+}389389+#endif
+332
drivers/iommu/generic_pt/pt_defs.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * This header is included before the format. It contains definitions66+ * that are required to compile the format. The header order is:77+ * pt_defs.h88+ * fmt_XX.h99+ * pt_common.h1010+ */1111+#ifndef __GENERIC_PT_DEFS_H1212+#define __GENERIC_PT_DEFS_H1313+1414+#include <linux/generic_pt/common.h>1515+1616+#include <linux/types.h>1717+#include <linux/atomic.h>1818+#include <linux/bits.h>1919+#include <linux/limits.h>2020+#include <linux/bug.h>2121+#include <linux/kconfig.h>2222+#include "pt_log2.h"2323+2424+/* Header self-compile default defines */2525+#ifndef pt_write_attrs2626+typedef u64 pt_vaddr_t;2727+typedef u64 pt_oaddr_t;2828+#endif2929+3030+struct pt_table_p;3131+3232+enum {3333+ PT_VADDR_MAX = sizeof(pt_vaddr_t) == 8 ? U64_MAX : U32_MAX,3434+ PT_VADDR_MAX_LG2 = sizeof(pt_vaddr_t) == 8 ? 64 : 32,3535+ PT_OADDR_MAX = sizeof(pt_oaddr_t) == 8 ? U64_MAX : U32_MAX,3636+ PT_OADDR_MAX_LG2 = sizeof(pt_oaddr_t) == 8 ? 64 : 32,3737+};3838+3939+/*4040+ * The format instantiation can have features wired off or on to optimize the4141+ * code gen. Supported features are just a reflection of what the current set of4242+ * kernel users want to use.4343+ */4444+#ifndef PT_SUPPORTED_FEATURES4545+#define PT_SUPPORTED_FEATURES 04646+#endif4747+4848+/*4949+ * When in debug mode we compile all formats with all features. This allows the5050+ * kunit to test the full matrix. SIGN_EXTEND can't co-exist with DYNAMIC_TOP or5151+ * FULL_VA. DMA_INCOHERENT requires a SW bit that not all formats have5252+ */5353+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)5454+enum {5555+ PT_ORIG_SUPPORTED_FEATURES = PT_SUPPORTED_FEATURES,5656+ PT_DEBUG_SUPPORTED_FEATURES =5757+ UINT_MAX &5858+ ~((PT_ORIG_SUPPORTED_FEATURES & BIT(PT_FEAT_DMA_INCOHERENT) ?5959+ 0 :6060+ BIT(PT_FEAT_DMA_INCOHERENT))) &6161+ ~((PT_ORIG_SUPPORTED_FEATURES & BIT(PT_FEAT_SIGN_EXTEND)) ?6262+ BIT(PT_FEAT_DYNAMIC_TOP) | BIT(PT_FEAT_FULL_VA) :6363+ BIT(PT_FEAT_SIGN_EXTEND)),6464+};6565+#undef PT_SUPPORTED_FEATURES6666+#define PT_SUPPORTED_FEATURES PT_DEBUG_SUPPORTED_FEATURES6767+#endif6868+6969+#ifndef PT_FORCE_ENABLED_FEATURES7070+#define PT_FORCE_ENABLED_FEATURES 07171+#endif7272+7373+/**7474+ * DOC: Generic Page Table Language7575+ *7676+ * Language used in Generic Page Table7777+ * VA7878+ * The input address to the page table, often the virtual address.7979+ * OA8080+ * The output address from the page table, often the physical address.8181+ * leaf8282+ * An entry that results in an output address.8383+ * start/end8484+ * An half-open range, e.g. [0,0) refers to no VA.8585+ * start/last8686+ * An inclusive closed range, e.g. [0,0] refers to the VA 08787+ * common8888+ * The generic page table container struct pt_common8989+ * level9090+ * Level 0 is always a table of only leaves with no futher table pointers.9191+ * Increasing levels increase the size of the table items. The least9292+ * significant VA bits used to index page tables are used to index the Level9393+ * 0 table. The various labels for table levels used by HW descriptions are9494+ * not used.9595+ * top_level9696+ * The inclusive highest level of the table. A two-level table9797+ * has a top level of 1.9898+ * table9999+ * A linear array of translation items for that level.100100+ * index101101+ * The position in a table of an element: item = table[index]102102+ * item103103+ * A single index in a table104104+ * entry105105+ * A single logical element in a table. If contiguous pages are not106106+ * supported then item and entry are the same thing, otherwise entry refers107107+ * to all the items that comprise a single contiguous translation.108108+ * item/entry_size109109+ * The number of bytes of VA the table index translates for.110110+ * If the item is a table entry then the next table covers111111+ * this size. If the entry translates to an output address then the112112+ * full OA is: OA | (VA % entry_size)113113+ * contig_count114114+ * The number of consecutive items fused into a single entry.115115+ * item_size * contig_count is the size of that entry's translation.116116+ * lg2117117+ * Indicates the value is encoded as log2, i.e. 1<<x is the actual value.118118+ * Normally the compiler is fine to optimize divide and mod with log2 values119119+ * automatically when inlining, however if the values are not constant120120+ * expressions it can't. So we do it by hand; we want to avoid 64-bit121121+ * divmod.122122+ */123123+124124+/* Returned by pt_load_entry() and for_each_pt_level_entry() */125125+enum pt_entry_type {126126+ PT_ENTRY_EMPTY,127127+ /* Entry is valid and points to a lower table level */128128+ PT_ENTRY_TABLE,129129+ /* Entry is valid and returns an output address */130130+ PT_ENTRY_OA,131131+};132132+133133+struct pt_range {134134+ struct pt_common *common;135135+ struct pt_table_p *top_table;136136+ pt_vaddr_t va;137137+ pt_vaddr_t last_va;138138+ u8 top_level;139139+ u8 max_vasz_lg2;140140+};141141+142142+/*143143+ * Similar to xa_state, this records information about an in-progress parse at a144144+ * single level.145145+ */146146+struct pt_state {147147+ struct pt_range *range;148148+ struct pt_table_p *table;149149+ struct pt_table_p *table_lower;150150+ u64 entry;151151+ enum pt_entry_type type;152152+ unsigned short index;153153+ unsigned short end_index;154154+ u8 level;155155+};156156+157157+#define pt_cur_table(pts, type) ((type *)((pts)->table))158158+159159+/*160160+ * Try to install a new table pointer. The locking methodology requires this to161161+ * be atomic (multiple threads can race to install a pointer). The losing162162+ * threads will fail the atomic and return false. They should free any memory163163+ * and reparse the table level again.164164+ */165165+#if !IS_ENABLED(CONFIG_GENERIC_ATOMIC64)166166+static inline bool pt_table_install64(struct pt_state *pts, u64 table_entry)167167+{168168+ u64 *entryp = pt_cur_table(pts, u64) + pts->index;169169+ u64 old_entry = pts->entry;170170+ bool ret;171171+172172+ /*173173+ * Ensure the zero'd table content itself is visible before its PTE can174174+ * be. release is a NOP on !SMP, but the HW is still doing an acquire.175175+ */176176+ if (!IS_ENABLED(CONFIG_SMP))177177+ dma_wmb();178178+ ret = try_cmpxchg64_release(entryp, &old_entry, table_entry);179179+ if (ret)180180+ pts->entry = table_entry;181181+ return ret;182182+}183183+#endif184184+185185+static inline bool pt_table_install32(struct pt_state *pts, u32 table_entry)186186+{187187+ u32 *entryp = pt_cur_table(pts, u32) + pts->index;188188+ u32 old_entry = pts->entry;189189+ bool ret;190190+191191+ /*192192+ * Ensure the zero'd table content itself is visible before its PTE can193193+ * be. release is a NOP on !SMP, but the HW is still doing an acquire.194194+ */195195+ if (!IS_ENABLED(CONFIG_SMP))196196+ dma_wmb();197197+ ret = try_cmpxchg_release(entryp, &old_entry, table_entry);198198+ if (ret)199199+ pts->entry = table_entry;200200+ return ret;201201+}202202+203203+#define PT_SUPPORTED_FEATURE(feature_nr) (PT_SUPPORTED_FEATURES & BIT(feature_nr))204204+205205+static inline bool pt_feature(const struct pt_common *common,206206+ unsigned int feature_nr)207207+{208208+ if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))209209+ return true;210210+ if (!PT_SUPPORTED_FEATURE(feature_nr))211211+ return false;212212+ return common->features & BIT(feature_nr);213213+}214214+215215+static inline bool pts_feature(const struct pt_state *pts,216216+ unsigned int feature_nr)217217+{218218+ return pt_feature(pts->range->common, feature_nr);219219+}220220+221221+/*222222+ * PT_WARN_ON is used for invariants that the kunit should be checking can't223223+ * happen.224224+ */225225+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)226226+#define PT_WARN_ON WARN_ON227227+#else228228+static inline bool PT_WARN_ON(bool condition)229229+{230230+ return false;231231+}232232+#endif233233+234234+/* These all work on the VA type */235235+#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2)236236+#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2)237237+#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2)238238+#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2)239239+#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2)240240+#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a, b_lg2)241241+#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val, b_lg2)242242+#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a, b_lg2)243243+#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2)244244+#define vaffs(a) ffs_t(pt_vaddr_t, a)245245+#define vafls(a) fls_t(pt_vaddr_t, a)246246+#define vaffz(a) ffz_t(pt_vaddr_t, a)247247+248248+/*249249+ * The full VA (fva) versions permit the lg2 value to be == PT_VADDR_MAX_LG2 and250250+ * generate a useful defined result. The non-fva versions will malfunction at251251+ * this extreme.252252+ */253253+static inline pt_vaddr_t fvalog2_div(pt_vaddr_t a, unsigned int b_lg2)254254+{255255+ if (PT_SUPPORTED_FEATURE(PT_FEAT_FULL_VA) && b_lg2 == PT_VADDR_MAX_LG2)256256+ return 0;257257+ return log2_div_t(pt_vaddr_t, a, b_lg2);258258+}259259+260260+static inline pt_vaddr_t fvalog2_mod(pt_vaddr_t a, unsigned int b_lg2)261261+{262262+ if (PT_SUPPORTED_FEATURE(PT_FEAT_FULL_VA) && b_lg2 == PT_VADDR_MAX_LG2)263263+ return a;264264+ return log2_mod_t(pt_vaddr_t, a, b_lg2);265265+}266266+267267+static inline bool fvalog2_div_eq(pt_vaddr_t a, pt_vaddr_t b,268268+ unsigned int c_lg2)269269+{270270+ if (PT_SUPPORTED_FEATURE(PT_FEAT_FULL_VA) && c_lg2 == PT_VADDR_MAX_LG2)271271+ return true;272272+ return log2_div_eq_t(pt_vaddr_t, a, b, c_lg2);273273+}274274+275275+static inline pt_vaddr_t fvalog2_set_mod(pt_vaddr_t a, pt_vaddr_t val,276276+ unsigned int b_lg2)277277+{278278+ if (PT_SUPPORTED_FEATURE(PT_FEAT_FULL_VA) && b_lg2 == PT_VADDR_MAX_LG2)279279+ return val;280280+ return log2_set_mod_t(pt_vaddr_t, a, val, b_lg2);281281+}282282+283283+static inline pt_vaddr_t fvalog2_set_mod_max(pt_vaddr_t a, unsigned int b_lg2)284284+{285285+ if (PT_SUPPORTED_FEATURE(PT_FEAT_FULL_VA) && b_lg2 == PT_VADDR_MAX_LG2)286286+ return PT_VADDR_MAX;287287+ return log2_set_mod_max_t(pt_vaddr_t, a, b_lg2);288288+}289289+290290+/* These all work on the OA type */291291+#define oalog2_to_int(a_lg2) log2_to_int_t(pt_oaddr_t, a_lg2)292292+#define oalog2_to_max_int(a_lg2) log2_to_max_int_t(pt_oaddr_t, a_lg2)293293+#define oalog2_div(a, b_lg2) log2_div_t(pt_oaddr_t, a, b_lg2)294294+#define oalog2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_oaddr_t, a, b, c_lg2)295295+#define oalog2_mod(a, b_lg2) log2_mod_t(pt_oaddr_t, a, b_lg2)296296+#define oalog2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_oaddr_t, a, b_lg2)297297+#define oalog2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_oaddr_t, a, val, b_lg2)298298+#define oalog2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_oaddr_t, a, b_lg2)299299+#define oalog2_mul(a, b_lg2) log2_mul_t(pt_oaddr_t, a, b_lg2)300300+#define oaffs(a) ffs_t(pt_oaddr_t, a)301301+#define oafls(a) fls_t(pt_oaddr_t, a)302302+#define oaffz(a) ffz_t(pt_oaddr_t, a)303303+304304+static inline uintptr_t _pt_top_set(struct pt_table_p *table_mem,305305+ unsigned int top_level)306306+{307307+ return top_level | (uintptr_t)table_mem;308308+}309309+310310+static inline void pt_top_set(struct pt_common *common,311311+ struct pt_table_p *table_mem,312312+ unsigned int top_level)313313+{314314+ WRITE_ONCE(common->top_of_table, _pt_top_set(table_mem, top_level));315315+}316316+317317+static inline void pt_top_set_level(struct pt_common *common,318318+ unsigned int top_level)319319+{320320+ pt_top_set(common, NULL, top_level);321321+}322322+323323+static inline unsigned int pt_top_get_level(const struct pt_common *common)324324+{325325+ return READ_ONCE(common->top_of_table) % (1 << PT_TOP_LEVEL_BITS);326326+}327327+328328+static inline bool pt_check_install_leaf_args(struct pt_state *pts,329329+ pt_oaddr_t oa,330330+ unsigned int oasz_lg2);331331+332332+#endif
+295
drivers/iommu/generic_pt/pt_fmt_defaults.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * Default definitions for formats that don't define these functions.66+ */77+#ifndef __GENERIC_PT_PT_FMT_DEFAULTS_H88+#define __GENERIC_PT_PT_FMT_DEFAULTS_H99+1010+#include "pt_defs.h"1111+#include <linux/log2.h>1212+1313+/* Header self-compile default defines */1414+#ifndef pt_load_entry_raw1515+#include "fmt/amdv1.h"1616+#endif1717+1818+/*1919+ * The format must provide PT_GRANULE_LG2SZ, PT_TABLEMEM_LG2SZ, and2020+ * PT_ITEM_WORD_SIZE. They must be the same at every level excluding the top.2121+ */2222+#ifndef pt_table_item_lg2sz2323+static inline unsigned int pt_table_item_lg2sz(const struct pt_state *pts)2424+{2525+ return PT_GRANULE_LG2SZ +2626+ (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE)) * pts->level;2727+}2828+#endif2929+3030+#ifndef pt_pgsz_lg2_to_level3131+static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,3232+ unsigned int pgsize_lg2)3333+{3434+ return ((unsigned int)(pgsize_lg2 - PT_GRANULE_LG2SZ)) /3535+ (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));3636+}3737+#endif3838+3939+/*4040+ * If not supplied by the format then contiguous pages are not supported.4141+ *4242+ * If contiguous pages are supported then the format must also provide4343+ * pt_contig_count_lg2() if it supports a single contiguous size per level,4444+ * or pt_possible_sizes() if it supports multiple sizes per level.4545+ */4646+#ifndef pt_entry_num_contig_lg24747+static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts)4848+{4949+ return ilog2(1);5050+}5151+5252+/*5353+ * Return the number of contiguous OA items forming an entry at this table level5454+ */5555+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)5656+{5757+ return ilog2(1);5858+}5959+#endif6060+6161+/* If not supplied by the format then dirty tracking is not supported */6262+#ifndef pt_entry_is_write_dirty6363+static inline bool pt_entry_is_write_dirty(const struct pt_state *pts)6464+{6565+ return false;6666+}6767+6868+static inline void pt_entry_make_write_clean(struct pt_state *pts)6969+{7070+}7171+7272+static inline bool pt_dirty_supported(struct pt_common *common)7373+{7474+ return false;7575+}7676+#else7777+/* If not supplied then dirty tracking is always enabled */7878+#ifndef pt_dirty_supported7979+static inline bool pt_dirty_supported(struct pt_common *common)8080+{8181+ return true;8282+}8383+#endif8484+#endif8585+8686+#ifndef pt_entry_make_write_dirty8787+static inline bool pt_entry_make_write_dirty(struct pt_state *pts)8888+{8989+ return false;9090+}9191+#endif9292+9393+/*9494+ * Format supplies either:9595+ * pt_entry_oa - OA is at the start of a contiguous entry9696+ * or9797+ * pt_item_oa - OA is adjusted for every item in a contiguous entry9898+ *9999+ * Build the missing one100100+ *101101+ * The internal helper _pt_entry_oa_fast() allows generating102102+ * an efficient pt_entry_oa_exact(), it doesn't care which103103+ * option is selected.104104+ */105105+#ifdef pt_entry_oa106106+static inline pt_oaddr_t pt_item_oa(const struct pt_state *pts)107107+{108108+ return pt_entry_oa(pts) |109109+ log2_mul(pts->index, pt_table_item_lg2sz(pts));110110+}111111+#define _pt_entry_oa_fast pt_entry_oa112112+#endif113113+114114+#ifdef pt_item_oa115115+static inline pt_oaddr_t pt_entry_oa(const struct pt_state *pts)116116+{117117+ return log2_set_mod(pt_item_oa(pts), 0,118118+ pt_entry_num_contig_lg2(pts) +119119+ pt_table_item_lg2sz(pts));120120+}121121+#define _pt_entry_oa_fast pt_item_oa122122+#endif123123+124124+/*125125+ * If not supplied by the format then use the constant126126+ * PT_MAX_OUTPUT_ADDRESS_LG2.127127+ */128128+#ifndef pt_max_oa_lg2129129+static inline unsigned int130130+pt_max_oa_lg2(const struct pt_common *common)131131+{132132+ return PT_MAX_OUTPUT_ADDRESS_LG2;133133+}134134+#endif135135+136136+#ifndef pt_has_system_page_size137137+static inline bool pt_has_system_page_size(const struct pt_common *common)138138+{139139+ return PT_GRANULE_LG2SZ == PAGE_SHIFT;140140+}141141+#endif142142+143143+/*144144+ * If not supplied by the format then assume only one contiguous size determined145145+ * by pt_contig_count_lg2()146146+ */147147+#ifndef pt_possible_sizes148148+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts);149149+150150+/* Return a bitmap of possible leaf page sizes at this level */151151+static inline pt_vaddr_t pt_possible_sizes(const struct pt_state *pts)152152+{153153+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);154154+155155+ if (!pt_can_have_leaf(pts))156156+ return 0;157157+ return log2_to_int(isz_lg2) |158158+ log2_to_int(pt_contig_count_lg2(pts) + isz_lg2);159159+}160160+#endif161161+162162+/* If not supplied by the format then use 0. */163163+#ifndef pt_full_va_prefix164164+static inline pt_vaddr_t pt_full_va_prefix(const struct pt_common *common)165165+{166166+ return 0;167167+}168168+#endif169169+170170+/* If not supplied by the format then zero fill using PT_ITEM_WORD_SIZE */171171+#ifndef pt_clear_entries172172+static inline void pt_clear_entries64(struct pt_state *pts,173173+ unsigned int num_contig_lg2)174174+{175175+ u64 *tablep = pt_cur_table(pts, u64) + pts->index;176176+ u64 *end = tablep + log2_to_int(num_contig_lg2);177177+178178+ PT_WARN_ON(log2_mod(pts->index, num_contig_lg2));179179+ for (; tablep != end; tablep++)180180+ WRITE_ONCE(*tablep, 0);181181+}182182+183183+static inline void pt_clear_entries32(struct pt_state *pts,184184+ unsigned int num_contig_lg2)185185+{186186+ u32 *tablep = pt_cur_table(pts, u32) + pts->index;187187+ u32 *end = tablep + log2_to_int(num_contig_lg2);188188+189189+ PT_WARN_ON(log2_mod(pts->index, num_contig_lg2));190190+ for (; tablep != end; tablep++)191191+ WRITE_ONCE(*tablep, 0);192192+}193193+194194+static inline void pt_clear_entries(struct pt_state *pts,195195+ unsigned int num_contig_lg2)196196+{197197+ if (PT_ITEM_WORD_SIZE == sizeof(u32))198198+ pt_clear_entries32(pts, num_contig_lg2);199199+ else200200+ pt_clear_entries64(pts, num_contig_lg2);201201+}202202+#define pt_clear_entries pt_clear_entries203203+#endif204204+205205+/* If not supplied then SW bits are not supported */206206+#ifdef pt_sw_bit207207+static inline bool pt_test_sw_bit_acquire(struct pt_state *pts,208208+ unsigned int bitnr)209209+{210210+ /* Acquire, pairs with pt_set_sw_bit_release() */211211+ smp_mb();212212+ /* For a contiguous entry the sw bit is only stored in the first item. */213213+ return pts->entry & pt_sw_bit(bitnr);214214+}215215+#define pt_test_sw_bit_acquire pt_test_sw_bit_acquire216216+217217+static inline void pt_set_sw_bit_release(struct pt_state *pts,218218+ unsigned int bitnr)219219+{220220+#if !IS_ENABLED(CONFIG_GENERIC_ATOMIC64)221221+ if (PT_ITEM_WORD_SIZE == sizeof(u64)) {222222+ u64 *entryp = pt_cur_table(pts, u64) + pts->index;223223+ u64 old_entry = pts->entry;224224+ u64 new_entry;225225+226226+ do {227227+ new_entry = old_entry | pt_sw_bit(bitnr);228228+ } while (!try_cmpxchg64_release(entryp, &old_entry, new_entry));229229+ pts->entry = new_entry;230230+ return;231231+ }232232+#endif233233+ if (PT_ITEM_WORD_SIZE == sizeof(u32)) {234234+ u32 *entryp = pt_cur_table(pts, u32) + pts->index;235235+ u32 old_entry = pts->entry;236236+ u32 new_entry;237237+238238+ do {239239+ new_entry = old_entry | pt_sw_bit(bitnr);240240+ } while (!try_cmpxchg_release(entryp, &old_entry, new_entry));241241+ pts->entry = new_entry;242242+ } else243243+ BUILD_BUG();244244+}245245+#define pt_set_sw_bit_release pt_set_sw_bit_release246246+#else247247+static inline unsigned int pt_max_sw_bit(struct pt_common *common)248248+{249249+ return 0;250250+}251251+252252+extern void __pt_no_sw_bit(void);253253+static inline bool pt_test_sw_bit_acquire(struct pt_state *pts,254254+ unsigned int bitnr)255255+{256256+ __pt_no_sw_bit();257257+ return false;258258+}259259+260260+static inline void pt_set_sw_bit_release(struct pt_state *pts,261261+ unsigned int bitnr)262262+{263263+ __pt_no_sw_bit();264264+}265265+#endif266266+267267+/*268268+ * Format can call in the pt_install_leaf_entry() to check the arguments are all269269+ * aligned correctly.270270+ */271271+static inline bool pt_check_install_leaf_args(struct pt_state *pts,272272+ pt_oaddr_t oa,273273+ unsigned int oasz_lg2)274274+{275275+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);276276+277277+ if (PT_WARN_ON(oalog2_mod(oa, oasz_lg2)))278278+ return false;279279+280280+#ifdef pt_possible_sizes281281+ if (PT_WARN_ON(isz_lg2 > oasz_lg2 ||282282+ oasz_lg2 > isz_lg2 + pt_num_items_lg2(pts)))283283+ return false;284284+#else285285+ if (PT_WARN_ON(oasz_lg2 != isz_lg2 &&286286+ oasz_lg2 != isz_lg2 + pt_contig_count_lg2(pts)))287287+ return false;288288+#endif289289+290290+ if (PT_WARN_ON(oalog2_mod(pts->index, oasz_lg2 - isz_lg2)))291291+ return false;292292+ return true;293293+}294294+295295+#endif
+636
drivers/iommu/generic_pt/pt_iter.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * Iterators for Generic Page Table66+ */77+#ifndef __GENERIC_PT_PT_ITER_H88+#define __GENERIC_PT_PT_ITER_H99+1010+#include "pt_common.h"1111+1212+#include <linux/errno.h>1313+1414+/*1515+ * Use to mangle symbols so that backtraces and the symbol table are1616+ * understandable. Any non-inlined function should get mangled like this.1717+ */1818+#define NS(fn) CONCATENATE(PTPFX, fn)1919+2020+/**2121+ * pt_check_range() - Validate the range can be iterated2222+ * @range: Range to validate2323+ *2424+ * Check that VA and last_va fall within the permitted range of VAs. If the2525+ * format is using PT_FEAT_SIGN_EXTEND then this also checks the sign extension2626+ * is correct.2727+ */2828+static inline int pt_check_range(struct pt_range *range)2929+{3030+ pt_vaddr_t prefix;3131+3232+ PT_WARN_ON(!range->max_vasz_lg2);3333+3434+ if (pt_feature(range->common, PT_FEAT_SIGN_EXTEND)) {3535+ PT_WARN_ON(range->common->max_vasz_lg2 != range->max_vasz_lg2);3636+ prefix = fvalog2_div(range->va, range->max_vasz_lg2 - 1) ?3737+ PT_VADDR_MAX :3838+ 0;3939+ } else {4040+ prefix = pt_full_va_prefix(range->common);4141+ }4242+4343+ if (!fvalog2_div_eq(range->va, prefix, range->max_vasz_lg2) ||4444+ !fvalog2_div_eq(range->last_va, prefix, range->max_vasz_lg2))4545+ return -ERANGE;4646+ return 0;4747+}4848+4949+/**5050+ * pt_index_to_va() - Update range->va to the current pts->index5151+ * @pts: Iteration State5252+ *5353+ * Adjust range->va to match the current index. This is done in a lazy manner5454+ * since computing the VA takes several instructions and is rarely required.5555+ */5656+static inline void pt_index_to_va(struct pt_state *pts)5757+{5858+ pt_vaddr_t lower_va;5959+6060+ lower_va = log2_mul(pts->index, pt_table_item_lg2sz(pts));6161+ pts->range->va = fvalog2_set_mod(pts->range->va, lower_va,6262+ pt_table_oa_lg2sz(pts));6363+}6464+6565+/*6666+ * Add index_count_lg2 number of entries to pts's VA and index. The VA will be6767+ * adjusted to the end of the contiguous block if it is currently in the middle.6868+ */6969+static inline void _pt_advance(struct pt_state *pts,7070+ unsigned int index_count_lg2)7171+{7272+ pts->index = log2_set_mod(pts->index + log2_to_int(index_count_lg2), 0,7373+ index_count_lg2);7474+}7575+7676+/**7777+ * pt_entry_fully_covered() - Check if the item or entry is entirely contained7878+ * within pts->range7979+ * @pts: Iteration State8080+ * @oasz_lg2: The size of the item to check, pt_table_item_lg2sz() or8181+ * pt_entry_oa_lg2sz()8282+ *8383+ * Returns: true if the item is fully enclosed by the pts->range.8484+ */8585+static inline bool pt_entry_fully_covered(const struct pt_state *pts,8686+ unsigned int oasz_lg2)8787+{8888+ struct pt_range *range = pts->range;8989+9090+ /* Range begins at the start of the entry */9191+ if (log2_mod(pts->range->va, oasz_lg2))9292+ return false;9393+9494+ /* Range ends past the end of the entry */9595+ if (!log2_div_eq(range->va, range->last_va, oasz_lg2))9696+ return true;9797+9898+ /* Range ends at the end of the entry */9999+ return log2_mod_eq_max(range->last_va, oasz_lg2);100100+}101101+102102+/**103103+ * pt_range_to_index() - Starting index for an iteration104104+ * @pts: Iteration State105105+ *106106+ * Return: the starting index for the iteration in pts.107107+ */108108+static inline unsigned int pt_range_to_index(const struct pt_state *pts)109109+{110110+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);111111+112112+ PT_WARN_ON(pts->level > pts->range->top_level);113113+ if (pts->range->top_level == pts->level)114114+ return log2_div(fvalog2_mod(pts->range->va,115115+ pts->range->max_vasz_lg2),116116+ isz_lg2);117117+ return log2_mod(log2_div(pts->range->va, isz_lg2),118118+ pt_num_items_lg2(pts));119119+}120120+121121+/**122122+ * pt_range_to_end_index() - Ending index iteration123123+ * @pts: Iteration State124124+ *125125+ * Return: the last index for the iteration in pts.126126+ */127127+static inline unsigned int pt_range_to_end_index(const struct pt_state *pts)128128+{129129+ unsigned int isz_lg2 = pt_table_item_lg2sz(pts);130130+ struct pt_range *range = pts->range;131131+ unsigned int num_entries_lg2;132132+133133+ if (range->va == range->last_va)134134+ return pts->index + 1;135135+136136+ if (pts->range->top_level == pts->level)137137+ return log2_div(fvalog2_mod(pts->range->last_va,138138+ pts->range->max_vasz_lg2),139139+ isz_lg2) +140140+ 1;141141+142142+ num_entries_lg2 = pt_num_items_lg2(pts);143143+144144+ /* last_va falls within this table */145145+ if (log2_div_eq(range->va, range->last_va, num_entries_lg2 + isz_lg2))146146+ return log2_mod(log2_div(pts->range->last_va, isz_lg2),147147+ num_entries_lg2) +148148+ 1;149149+150150+ return log2_to_int(num_entries_lg2);151151+}152152+153153+static inline void _pt_iter_first(struct pt_state *pts)154154+{155155+ pts->index = pt_range_to_index(pts);156156+ pts->end_index = pt_range_to_end_index(pts);157157+ PT_WARN_ON(pts->index > pts->end_index);158158+}159159+160160+static inline bool _pt_iter_load(struct pt_state *pts)161161+{162162+ if (pts->index >= pts->end_index)163163+ return false;164164+ pt_load_entry(pts);165165+ return true;166166+}167167+168168+/**169169+ * pt_next_entry() - Advance pts to the next entry170170+ * @pts: Iteration State171171+ *172172+ * Update pts to go to the next index at this level. If pts is pointing at a173173+ * contiguous entry then the index may advance my more than one.174174+ */175175+static inline void pt_next_entry(struct pt_state *pts)176176+{177177+ if (pts->type == PT_ENTRY_OA &&178178+ !__builtin_constant_p(pt_entry_num_contig_lg2(pts) == 0))179179+ _pt_advance(pts, pt_entry_num_contig_lg2(pts));180180+ else181181+ pts->index++;182182+ pt_index_to_va(pts);183183+}184184+185185+/**186186+ * for_each_pt_level_entry() - For loop wrapper over entries in the range187187+ * @pts: Iteration State188188+ *189189+ * This is the basic iteration primitive. It iterates over all the entries in190190+ * pts->range that fall within the pts's current table level. Each step does191191+ * pt_load_entry(pts).192192+ */193193+#define for_each_pt_level_entry(pts) \194194+ for (_pt_iter_first(pts); _pt_iter_load(pts); pt_next_entry(pts))195195+196196+/**197197+ * pt_load_single_entry() - Version of pt_load_entry() usable within a walker198198+ * @pts: Iteration State199199+ *200200+ * Alternative to for_each_pt_level_entry() if the walker function uses only a201201+ * single entry.202202+ */203203+static inline enum pt_entry_type pt_load_single_entry(struct pt_state *pts)204204+{205205+ pts->index = pt_range_to_index(pts);206206+ pt_load_entry(pts);207207+ return pts->type;208208+}209209+210210+static __always_inline struct pt_range _pt_top_range(struct pt_common *common,211211+ uintptr_t top_of_table)212212+{213213+ struct pt_range range = {214214+ .common = common,215215+ .top_table =216216+ (struct pt_table_p *)(top_of_table &217217+ ~(uintptr_t)PT_TOP_LEVEL_MASK),218218+ .top_level = top_of_table % (1 << PT_TOP_LEVEL_BITS),219219+ };220220+ struct pt_state pts = { .range = &range, .level = range.top_level };221221+ unsigned int max_vasz_lg2;222222+223223+ max_vasz_lg2 = common->max_vasz_lg2;224224+ if (pt_feature(common, PT_FEAT_DYNAMIC_TOP) &&225225+ pts.level != PT_MAX_TOP_LEVEL)226226+ max_vasz_lg2 = min_t(unsigned int, common->max_vasz_lg2,227227+ pt_num_items_lg2(&pts) +228228+ pt_table_item_lg2sz(&pts));229229+230230+ /*231231+ * The top range will default to the lower region only with sign extend.232232+ */233233+ range.max_vasz_lg2 = max_vasz_lg2;234234+ if (pt_feature(common, PT_FEAT_SIGN_EXTEND))235235+ max_vasz_lg2--;236236+237237+ range.va = fvalog2_set_mod(pt_full_va_prefix(common), 0, max_vasz_lg2);238238+ range.last_va =239239+ fvalog2_set_mod_max(pt_full_va_prefix(common), max_vasz_lg2);240240+ return range;241241+}242242+243243+/**244244+ * pt_top_range() - Return a range that spans part of the top level245245+ * @common: Table246246+ *247247+ * For PT_FEAT_SIGN_EXTEND this will return the lower range, and cover half the248248+ * total page table. Otherwise it returns the entire page table.249249+ */250250+static __always_inline struct pt_range pt_top_range(struct pt_common *common)251251+{252252+ /*253253+ * The top pointer can change without locking. We capture the value and254254+ * it's level here and are safe to walk it so long as both values are255255+ * captured without tearing.256256+ */257257+ return _pt_top_range(common, READ_ONCE(common->top_of_table));258258+}259259+260260+/**261261+ * pt_all_range() - Return a range that spans the entire page table262262+ * @common: Table263263+ *264264+ * The returned range spans the whole page table. Due to how PT_FEAT_SIGN_EXTEND265265+ * is supported range->va and range->last_va will be incorrect during the266266+ * iteration and must not be accessed.267267+ */268268+static inline struct pt_range pt_all_range(struct pt_common *common)269269+{270270+ struct pt_range range = pt_top_range(common);271271+272272+ if (!pt_feature(common, PT_FEAT_SIGN_EXTEND))273273+ return range;274274+275275+ /*276276+ * Pretend the table is linear from 0 without a sign extension. This277277+ * generates the correct indexes for iteration.278278+ */279279+ range.last_va = fvalog2_set_mod_max(0, range.max_vasz_lg2);280280+ return range;281281+}282282+283283+/**284284+ * pt_upper_range() - Return a range that spans part of the top level285285+ * @common: Table286286+ *287287+ * For PT_FEAT_SIGN_EXTEND this will return the upper range, and cover half the288288+ * total page table. Otherwise it returns the entire page table.289289+ */290290+static inline struct pt_range pt_upper_range(struct pt_common *common)291291+{292292+ struct pt_range range = pt_top_range(common);293293+294294+ if (!pt_feature(common, PT_FEAT_SIGN_EXTEND))295295+ return range;296296+297297+ range.va = fvalog2_set_mod(PT_VADDR_MAX, 0, range.max_vasz_lg2 - 1);298298+ range.last_va = PT_VADDR_MAX;299299+ return range;300300+}301301+302302+/**303303+ * pt_make_range() - Return a range that spans part of the table304304+ * @common: Table305305+ * @va: Start address306306+ * @last_va: Last address307307+ *308308+ * The caller must validate the range with pt_check_range() before using it.309309+ */310310+static __always_inline struct pt_range311311+pt_make_range(struct pt_common *common, pt_vaddr_t va, pt_vaddr_t last_va)312312+{313313+ struct pt_range range =314314+ _pt_top_range(common, READ_ONCE(common->top_of_table));315315+316316+ range.va = va;317317+ range.last_va = last_va;318318+319319+ return range;320320+}321321+322322+/*323323+ * Span a slice of the table starting at a lower table level from an active324324+ * walk.325325+ */326326+static __always_inline struct pt_range327327+pt_make_child_range(const struct pt_range *parent, pt_vaddr_t va,328328+ pt_vaddr_t last_va)329329+{330330+ struct pt_range range = *parent;331331+332332+ range.va = va;333333+ range.last_va = last_va;334334+335335+ PT_WARN_ON(last_va < va);336336+ PT_WARN_ON(pt_check_range(&range));337337+338338+ return range;339339+}340340+341341+/**342342+ * pt_init() - Initialize a pt_state on the stack343343+ * @range: Range pointer to embed in the state344344+ * @level: Table level for the state345345+ * @table: Pointer to the table memory at level346346+ *347347+ * Helper to initialize the on-stack pt_state from walker arguments.348348+ */349349+static __always_inline struct pt_state350350+pt_init(struct pt_range *range, unsigned int level, struct pt_table_p *table)351351+{352352+ struct pt_state pts = {353353+ .range = range,354354+ .table = table,355355+ .level = level,356356+ };357357+ return pts;358358+}359359+360360+/**361361+ * pt_init_top() - Initialize a pt_state on the stack362362+ * @range: Range pointer to embed in the state363363+ *364364+ * The pt_state points to the top most level.365365+ */366366+static __always_inline struct pt_state pt_init_top(struct pt_range *range)367367+{368368+ return pt_init(range, range->top_level, range->top_table);369369+}370370+371371+typedef int (*pt_level_fn_t)(struct pt_range *range, void *arg,372372+ unsigned int level, struct pt_table_p *table);373373+374374+/**375375+ * pt_descend() - Recursively invoke the walker for the lower level376376+ * @pts: Iteration State377377+ * @arg: Value to pass to the function378378+ * @fn: Walker function to call379379+ *380380+ * pts must point to a table item. Invoke fn as a walker on the table381381+ * pts points to.382382+ */383383+static __always_inline int pt_descend(struct pt_state *pts, void *arg,384384+ pt_level_fn_t fn)385385+{386386+ int ret;387387+388388+ if (PT_WARN_ON(!pts->table_lower))389389+ return -EINVAL;390390+391391+ ret = (*fn)(pts->range, arg, pts->level - 1, pts->table_lower);392392+ return ret;393393+}394394+395395+/**396396+ * pt_walk_range() - Walk over a VA range397397+ * @range: Range pointer398398+ * @fn: Walker function to call399399+ * @arg: Value to pass to the function400400+ *401401+ * Walk over a VA range. The caller should have done a validity check, at402402+ * least calling pt_check_range(), when building range. The walk will403403+ * start at the top most table.404404+ */405405+static __always_inline int pt_walk_range(struct pt_range *range,406406+ pt_level_fn_t fn, void *arg)407407+{408408+ return fn(range, arg, range->top_level, range->top_table);409409+}410410+411411+/*412412+ * pt_walk_descend() - Recursively invoke the walker for a slice of a lower413413+ * level414414+ * @pts: Iteration State415415+ * @va: Start address416416+ * @last_va: Last address417417+ * @fn: Walker function to call418418+ * @arg: Value to pass to the function419419+ *420420+ * With pts pointing at a table item this will descend and over a slice of the421421+ * lower table. The caller must ensure that va/last_va are within the table422422+ * item. This creates a new walk and does not alter pts or pts->range.423423+ */424424+static __always_inline int pt_walk_descend(const struct pt_state *pts,425425+ pt_vaddr_t va, pt_vaddr_t last_va,426426+ pt_level_fn_t fn, void *arg)427427+{428428+ struct pt_range range = pt_make_child_range(pts->range, va, last_va);429429+430430+ if (PT_WARN_ON(!pt_can_have_table(pts)) ||431431+ PT_WARN_ON(!pts->table_lower))432432+ return -EINVAL;433433+434434+ return fn(&range, arg, pts->level - 1, pts->table_lower);435435+}436436+437437+/*438438+ * pt_walk_descend_all() - Recursively invoke the walker for a table item439439+ * @parent_pts: Iteration State440440+ * @fn: Walker function to call441441+ * @arg: Value to pass to the function442442+ *443443+ * With pts pointing at a table item this will descend and over the entire lower444444+ * table. This creates a new walk and does not alter pts or pts->range.445445+ */446446+static __always_inline int447447+pt_walk_descend_all(const struct pt_state *parent_pts, pt_level_fn_t fn,448448+ void *arg)449449+{450450+ unsigned int isz_lg2 = pt_table_item_lg2sz(parent_pts);451451+452452+ return pt_walk_descend(parent_pts,453453+ log2_set_mod(parent_pts->range->va, 0, isz_lg2),454454+ log2_set_mod_max(parent_pts->range->va, isz_lg2),455455+ fn, arg);456456+}457457+458458+/**459459+ * pt_range_slice() - Return a range that spans indexes460460+ * @pts: Iteration State461461+ * @start_index: Starting index within pts462462+ * @end_index: Ending index within pts463463+ *464464+ * Create a range than spans an index range of the current table level465465+ * pt_state points at.466466+ */467467+static inline struct pt_range pt_range_slice(const struct pt_state *pts,468468+ unsigned int start_index,469469+ unsigned int end_index)470470+{471471+ unsigned int table_lg2sz = pt_table_oa_lg2sz(pts);472472+ pt_vaddr_t last_va;473473+ pt_vaddr_t va;474474+475475+ va = fvalog2_set_mod(pts->range->va,476476+ log2_mul(start_index, pt_table_item_lg2sz(pts)),477477+ table_lg2sz);478478+ last_va = fvalog2_set_mod(479479+ pts->range->va,480480+ log2_mul(end_index, pt_table_item_lg2sz(pts)) - 1, table_lg2sz);481481+ return pt_make_child_range(pts->range, va, last_va);482482+}483483+484484+/**485485+ * pt_top_memsize_lg2()486486+ * @common: Table487487+ * @top_of_table: Top of table value from _pt_top_set()488488+ *489489+ * Compute the allocation size of the top table. For PT_FEAT_DYNAMIC_TOP this490490+ * will compute the top size assuming the table will grow.491491+ */492492+static inline unsigned int pt_top_memsize_lg2(struct pt_common *common,493493+ uintptr_t top_of_table)494494+{495495+ struct pt_range range = _pt_top_range(common, top_of_table);496496+ struct pt_state pts = pt_init_top(&range);497497+ unsigned int num_items_lg2;498498+499499+ num_items_lg2 = common->max_vasz_lg2 - pt_table_item_lg2sz(&pts);500500+ if (range.top_level != PT_MAX_TOP_LEVEL &&501501+ pt_feature(common, PT_FEAT_DYNAMIC_TOP))502502+ num_items_lg2 = min(num_items_lg2, pt_num_items_lg2(&pts));503503+504504+ /* Round up the allocation size to the minimum alignment */505505+ return max(ffs_t(u64, PT_TOP_PHYS_MASK),506506+ num_items_lg2 + ilog2(PT_ITEM_WORD_SIZE));507507+}508508+509509+/**510510+ * pt_compute_best_pgsize() - Determine the best page size for leaf entries511511+ * @pgsz_bitmap: Permitted page sizes512512+ * @va: Starting virtual address for the leaf entry513513+ * @last_va: Last virtual address for the leaf entry, sets the max page size514514+ * @oa: Starting output address for the leaf entry515515+ *516516+ * Compute the largest page size for va, last_va, and oa together and return it517517+ * in lg2. The largest page size depends on the format's supported page sizes at518518+ * this level, and the relative alignment of the VA and OA addresses. 0 means519519+ * the OA cannot be stored with the provided pgsz_bitmap.520520+ */521521+static inline unsigned int pt_compute_best_pgsize(pt_vaddr_t pgsz_bitmap,522522+ pt_vaddr_t va,523523+ pt_vaddr_t last_va,524524+ pt_oaddr_t oa)525525+{526526+ unsigned int best_pgsz_lg2;527527+ unsigned int pgsz_lg2;528528+ pt_vaddr_t len = last_va - va + 1;529529+ pt_vaddr_t mask;530530+531531+ if (PT_WARN_ON(va >= last_va))532532+ return 0;533533+534534+ /*535535+ * Given a VA/OA pair the best page size is the largest page size536536+ * where:537537+ *538538+ * 1) VA and OA start at the page. Bitwise this is the count of least539539+ * significant 0 bits.540540+ * This also implies that last_va/oa has the same prefix as va/oa.541541+ */542542+ mask = va | oa;543543+544544+ /*545545+ * 2) The page size is not larger than the last_va (length). Since page546546+ * sizes are always power of two this can't be larger than the547547+ * largest power of two factor of the length.548548+ */549549+ mask |= log2_to_int(vafls(len) - 1);550550+551551+ best_pgsz_lg2 = vaffs(mask);552552+553553+ /* Choose the highest bit <= best_pgsz_lg2 */554554+ if (best_pgsz_lg2 < PT_VADDR_MAX_LG2 - 1)555555+ pgsz_bitmap = log2_mod(pgsz_bitmap, best_pgsz_lg2 + 1);556556+557557+ pgsz_lg2 = vafls(pgsz_bitmap);558558+ if (!pgsz_lg2)559559+ return 0;560560+561561+ pgsz_lg2--;562562+563563+ PT_WARN_ON(log2_mod(va, pgsz_lg2) != 0);564564+ PT_WARN_ON(oalog2_mod(oa, pgsz_lg2) != 0);565565+ PT_WARN_ON(va + log2_to_int(pgsz_lg2) - 1 > last_va);566566+ PT_WARN_ON(!log2_div_eq(va, va + log2_to_int(pgsz_lg2) - 1, pgsz_lg2));567567+ PT_WARN_ON(568568+ !oalog2_div_eq(oa, oa + log2_to_int(pgsz_lg2) - 1, pgsz_lg2));569569+ return pgsz_lg2;570570+}571571+572572+#define _PT_MAKE_CALL_LEVEL(fn) \573573+ static __always_inline int fn(struct pt_range *range, void *arg, \574574+ unsigned int level, \575575+ struct pt_table_p *table) \576576+ { \577577+ static_assert(PT_MAX_TOP_LEVEL <= 5); \578578+ if (level == 0) \579579+ return CONCATENATE(fn, 0)(range, arg, 0, table); \580580+ if (level == 1 || PT_MAX_TOP_LEVEL == 1) \581581+ return CONCATENATE(fn, 1)(range, arg, 1, table); \582582+ if (level == 2 || PT_MAX_TOP_LEVEL == 2) \583583+ return CONCATENATE(fn, 2)(range, arg, 2, table); \584584+ if (level == 3 || PT_MAX_TOP_LEVEL == 3) \585585+ return CONCATENATE(fn, 3)(range, arg, 3, table); \586586+ if (level == 4 || PT_MAX_TOP_LEVEL == 4) \587587+ return CONCATENATE(fn, 4)(range, arg, 4, table); \588588+ return CONCATENATE(fn, 5)(range, arg, 5, table); \589589+ }590590+591591+static inline int __pt_make_level_fn_err(struct pt_range *range, void *arg,592592+ unsigned int unused_level,593593+ struct pt_table_p *table)594594+{595595+ static_assert(PT_MAX_TOP_LEVEL <= 5);596596+ return -EPROTOTYPE;597597+}598598+599599+#define __PT_MAKE_LEVEL_FN(fn, level, descend_fn, do_fn) \600600+ static inline int fn(struct pt_range *range, void *arg, \601601+ unsigned int unused_level, \602602+ struct pt_table_p *table) \603603+ { \604604+ return do_fn(range, arg, level, table, descend_fn); \605605+ }606606+607607+/**608608+ * PT_MAKE_LEVELS() - Build an unwound walker609609+ * @fn: Name of the walker function610610+ * @do_fn: Function to call at each level611611+ *612612+ * This builds a function call tree that can be fully inlined.613613+ * The caller must provide a function body in an __always_inline function::614614+ *615615+ * static __always_inline int do_fn(struct pt_range *range, void *arg,616616+ * unsigned int level, struct pt_table_p *table,617617+ * pt_level_fn_t descend_fn)618618+ *619619+ * An inline function will be created for each table level that calls do_fn with620620+ * a compile time constant for level and a pointer to the next lower function.621621+ * This generates an optimally inlined walk where each of the functions sees a622622+ * constant level and can codegen the exact constants/etc for that level.623623+ *624624+ * Note this can produce a lot of code!625625+ */626626+#define PT_MAKE_LEVELS(fn, do_fn) \627627+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 0), 0, __pt_make_level_fn_err, \628628+ do_fn); \629629+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 1), 1, CONCATENATE(fn, 0), do_fn); \630630+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 2), 2, CONCATENATE(fn, 1), do_fn); \631631+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 3), 3, CONCATENATE(fn, 2), do_fn); \632632+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 4), 4, CONCATENATE(fn, 3), do_fn); \633633+ __PT_MAKE_LEVEL_FN(CONCATENATE(fn, 5), 5, CONCATENATE(fn, 4), do_fn); \634634+ _PT_MAKE_CALL_LEVEL(fn)635635+636636+#endif
+122
drivers/iommu/generic_pt/pt_log2.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ *55+ * Helper macros for working with log2 values66+ *77+ */88+#ifndef __GENERIC_PT_LOG2_H99+#define __GENERIC_PT_LOG2_H1010+#include <linux/bitops.h>1111+#include <linux/limits.h>1212+1313+/* Compute a */1414+#define log2_to_int_t(type, a_lg2) ((type)(((type)1) << (a_lg2)))1515+static_assert(log2_to_int_t(unsigned int, 0) == 1);1616+1717+/* Compute a - 1 (aka all low bits set) */1818+#define log2_to_max_int_t(type, a_lg2) ((type)(log2_to_int_t(type, a_lg2) - 1))1919+2020+/* Compute a / b */2121+#define log2_div_t(type, a, b_lg2) ((type)(((type)a) >> (b_lg2)))2222+static_assert(log2_div_t(unsigned int, 4, 2) == 1);2323+2424+/*2525+ * Compute:2626+ * a / c == b / c2727+ * aka the high bits are equal2828+ */2929+#define log2_div_eq_t(type, a, b, c_lg2) \3030+ (log2_div_t(type, (a) ^ (b), c_lg2) == 0)3131+static_assert(log2_div_eq_t(unsigned int, 1, 1, 2));3232+3333+/* Compute a % b */3434+#define log2_mod_t(type, a, b_lg2) \3535+ ((type)(((type)a) & log2_to_max_int_t(type, b_lg2)))3636+static_assert(log2_mod_t(unsigned int, 1, 2) == 1);3737+3838+/*3939+ * Compute:4040+ * a % b == b - 14141+ * aka the low bits are all 1s4242+ */4343+#define log2_mod_eq_max_t(type, a, b_lg2) \4444+ (log2_mod_t(type, a, b_lg2) == log2_to_max_int_t(type, b_lg2))4545+static_assert(log2_mod_eq_max_t(unsigned int, 3, 2));4646+4747+/*4848+ * Return a value such that:4949+ * a / b == ret / b5050+ * ret % b == val5151+ * aka set the low bits to val. val must be < b5252+ */5353+#define log2_set_mod_t(type, a, val, b_lg2) \5454+ ((((type)(a)) & (~log2_to_max_int_t(type, b_lg2))) | ((type)(val)))5555+static_assert(log2_set_mod_t(unsigned int, 3, 1, 2) == 1);5656+5757+/* Return a value such that:5858+ * a / b == ret / b5959+ * ret % b == b - 16060+ * aka set the low bits to all 1s6161+ */6262+#define log2_set_mod_max_t(type, a, b_lg2) \6363+ (((type)(a)) | log2_to_max_int_t(type, b_lg2))6464+static_assert(log2_set_mod_max_t(unsigned int, 2, 2) == 3);6565+6666+/* Compute a * b */6767+#define log2_mul_t(type, a, b_lg2) ((type)(((type)a) << (b_lg2)))6868+static_assert(log2_mul_t(unsigned int, 2, 2) == 8);6969+7070+#define _dispatch_sz(type, fn, a) \7171+ (sizeof(type) == 4 ? fn##32((u32)a) : fn##64(a))7272+7373+/*7474+ * Return the highest value such that:7575+ * fls_t(u32, 0) == 07676+ * fls_t(u3, 1) == 17777+ * a >= log2_to_int(ret - 1)7878+ * aka find last set bit7979+ */8080+static inline unsigned int fls32(u32 a)8181+{8282+ return fls(a);8383+}8484+#define fls_t(type, a) _dispatch_sz(type, fls, a)8585+8686+/*8787+ * Return the highest value such that:8888+ * ffs_t(u32, 0) == UNDEFINED8989+ * ffs_t(u32, 1) == 09090+ * log_mod(a, ret) == 09191+ * aka find first set bit9292+ */9393+static inline unsigned int __ffs32(u32 a)9494+{9595+ return __ffs(a);9696+}9797+#define ffs_t(type, a) _dispatch_sz(type, __ffs, a)9898+9999+/*100100+ * Return the highest value such that:101101+ * ffz_t(u32, U32_MAX) == UNDEFINED102102+ * ffz_t(u32, 0) == 0103103+ * ffz_t(u32, 1) == 1104104+ * log_mod(a, ret) == log_to_max_int(ret)105105+ * aka find first zero bit106106+ */107107+static inline unsigned int ffz32(u32 a)108108+{109109+ return ffz(a);110110+}111111+static inline unsigned int ffz64(u64 a)112112+{113113+ if (sizeof(u64) == sizeof(unsigned long))114114+ return ffz(a);115115+116116+ if ((u32)a == U32_MAX)117117+ return ffz32(a >> 32) + 32;118118+ return ffz32(a);119119+}120120+#define ffz_t(type, a) _dispatch_sz(type, ffz, a)121121+122122+#endif
+5-1
drivers/iommu/intel/Kconfig
···1313 bool "Support for Intel IOMMU using DMA Remapping Devices"1414 depends on PCI_MSI && ACPI && X861515 select IOMMU_API1616+ select GENERIC_PT1717+ select IOMMU_PT1818+ select IOMMU_PT_X86_641919+ select IOMMU_PT_VTDSS1620 select IOMMU_IOVA1721 select IOMMU_IOPF1822 select IOMMUFD_DRIVER if IOMMUFD···70667167config INTEL_IOMMU_FLOPPY_WA7268 def_bool y7373- depends on X866969+ depends on X86 && BLK_DEV_FD7470 help7571 Floppy disk drivers are known to bypass DMA API calls7672 thereby failing to work when IOMMU is enabled. This
+175-756
drivers/iommu/intel/iommu.c
···45454646#define DEFAULT_DOMAIN_ADDRESS_WIDTH 5747474848-#define __DOMAIN_MAX_PFN(gaw) ((((uint64_t)1) << ((gaw) - VTD_PAGE_SHIFT)) - 1)4949-#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << (gaw)) - 1)5050-5151-/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR5252- to match. That way, we can use 'unsigned long' for PFNs with impunity. */5353-#define DOMAIN_MAX_PFN(gaw) ((unsigned long) min_t(uint64_t, \5454- __DOMAIN_MAX_PFN(gaw), (unsigned long)-1))5555-#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)5656-5748static void __init check_tylersburg_isoch(void);4949+static int intel_iommu_set_dirty_tracking(struct iommu_domain *domain,5050+ bool enable);5851static int rwbf_quirk;59526053#define rwbf_required(iommu) (rwbf_quirk || cap_rwbf((iommu)->cap))···210217#define IDENTMAP_AZALIA 4211218212219const struct iommu_ops intel_iommu_ops;213213-static const struct iommu_dirty_ops intel_dirty_ops;214220215221static bool translation_pre_enabled(struct intel_iommu *iommu)216222{···277285}278286__setup("intel_iommu=", intel_iommu_setup);279287280280-static int domain_pfn_supported(struct dmar_domain *domain, unsigned long pfn)281281-{282282- int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;283283-284284- return !(addr_width < BITS_PER_LONG && pfn >> addr_width);285285-}286286-287288/*288289 * Calculate the Supported Adjusted Guest Address Widths of an IOMMU.289290 * Refer to 11.4.2 of the VT-d spec for the encoding of each bit of···336351{337352 return sm_supported(iommu) ?338353 ecap_smpwc(iommu->ecap) : ecap_coherent(iommu->ecap);339339-}340340-341341-/* Return the super pagesize bitmap if supported. */342342-static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain)343343-{344344- unsigned long bitmap = 0;345345-346346- /*347347- * 1-level super page supports page size of 2MiB, 2-level super page348348- * supports page size of both 2MiB and 1GiB.349349- */350350- if (domain->iommu_superpage == 1)351351- bitmap |= SZ_2M;352352- else if (domain->iommu_superpage == 2)353353- bitmap |= SZ_2M | SZ_1G;354354-355355- return bitmap;356354}357355358356struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,···524556 return iommu;525557}526558527527-static void domain_flush_cache(struct dmar_domain *domain,528528- void *addr, int size)529529-{530530- if (!domain->iommu_coherency)531531- clflush_cache_range(addr, size);532532-}533533-534559static void free_context_table(struct intel_iommu *iommu)535560{536561 struct context_entry *context;···667706 pgtable_walk(iommu, addr >> VTD_PAGE_SHIFT, bus, devfn, pgtable, level);668707}669708#endif670670-671671-static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,672672- unsigned long pfn, int *target_level,673673- gfp_t gfp)674674-{675675- struct dma_pte *parent, *pte;676676- int level = agaw_to_level(domain->agaw);677677- int offset;678678-679679- if (!domain_pfn_supported(domain, pfn))680680- /* Address beyond IOMMU's addressing capabilities. */681681- return NULL;682682-683683- parent = domain->pgd;684684-685685- while (1) {686686- void *tmp_page;687687-688688- offset = pfn_level_offset(pfn, level);689689- pte = &parent[offset];690690- if (!*target_level && (dma_pte_superpage(pte) || !dma_pte_present(pte)))691691- break;692692- if (level == *target_level)693693- break;694694-695695- if (!dma_pte_present(pte)) {696696- uint64_t pteval, tmp;697697-698698- tmp_page = iommu_alloc_pages_node_sz(domain->nid, gfp,699699- SZ_4K);700700-701701- if (!tmp_page)702702- return NULL;703703-704704- domain_flush_cache(domain, tmp_page, VTD_PAGE_SIZE);705705- pteval = virt_to_phys(tmp_page) | DMA_PTE_READ |706706- DMA_PTE_WRITE;707707- if (domain->use_first_level)708708- pteval |= DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;709709-710710- tmp = 0ULL;711711- if (!try_cmpxchg64(&pte->val, &tmp, pteval))712712- /* Someone else set it while we were thinking; use theirs. */713713- iommu_free_pages(tmp_page);714714- else715715- domain_flush_cache(domain, pte, sizeof(*pte));716716- }717717- if (level == 1)718718- break;719719-720720- parent = phys_to_virt(dma_pte_addr(pte));721721- level--;722722- }723723-724724- if (!*target_level)725725- *target_level = level;726726-727727- return pte;728728-}729729-730730-/* return address's pte at specific level */731731-static struct dma_pte *dma_pfn_level_pte(struct dmar_domain *domain,732732- unsigned long pfn,733733- int level, int *large_page)734734-{735735- struct dma_pte *parent, *pte;736736- int total = agaw_to_level(domain->agaw);737737- int offset;738738-739739- parent = domain->pgd;740740- while (level <= total) {741741- offset = pfn_level_offset(pfn, total);742742- pte = &parent[offset];743743- if (level == total)744744- return pte;745745-746746- if (!dma_pte_present(pte)) {747747- *large_page = total;748748- break;749749- }750750-751751- if (dma_pte_superpage(pte)) {752752- *large_page = total;753753- return pte;754754- }755755-756756- parent = phys_to_virt(dma_pte_addr(pte));757757- total--;758758- }759759- return NULL;760760-}761761-762762-/* clear last level pte, a tlb flush should be followed */763763-static void dma_pte_clear_range(struct dmar_domain *domain,764764- unsigned long start_pfn,765765- unsigned long last_pfn)766766-{767767- unsigned int large_page;768768- struct dma_pte *first_pte, *pte;769769-770770- if (WARN_ON(!domain_pfn_supported(domain, last_pfn)) ||771771- WARN_ON(start_pfn > last_pfn))772772- return;773773-774774- /* we don't need lock here; nobody else touches the iova range */775775- do {776776- large_page = 1;777777- first_pte = pte = dma_pfn_level_pte(domain, start_pfn, 1, &large_page);778778- if (!pte) {779779- start_pfn = align_to_level(start_pfn + 1, large_page + 1);780780- continue;781781- }782782- do {783783- dma_clear_pte(pte);784784- start_pfn += lvl_to_nr_pages(large_page);785785- pte++;786786- } while (start_pfn <= last_pfn && !first_pte_in_page(pte));787787-788788- domain_flush_cache(domain, first_pte,789789- (void *)pte - (void *)first_pte);790790-791791- } while (start_pfn && start_pfn <= last_pfn);792792-}793793-794794-static void dma_pte_free_level(struct dmar_domain *domain, int level,795795- int retain_level, struct dma_pte *pte,796796- unsigned long pfn, unsigned long start_pfn,797797- unsigned long last_pfn)798798-{799799- pfn = max(start_pfn, pfn);800800- pte = &pte[pfn_level_offset(pfn, level)];801801-802802- do {803803- unsigned long level_pfn;804804- struct dma_pte *level_pte;805805-806806- if (!dma_pte_present(pte) || dma_pte_superpage(pte))807807- goto next;808808-809809- level_pfn = pfn & level_mask(level);810810- level_pte = phys_to_virt(dma_pte_addr(pte));811811-812812- if (level > 2) {813813- dma_pte_free_level(domain, level - 1, retain_level,814814- level_pte, level_pfn, start_pfn,815815- last_pfn);816816- }817817-818818- /*819819- * Free the page table if we're below the level we want to820820- * retain and the range covers the entire table.821821- */822822- if (level < retain_level && !(start_pfn > level_pfn ||823823- last_pfn < level_pfn + level_size(level) - 1)) {824824- dma_clear_pte(pte);825825- domain_flush_cache(domain, pte, sizeof(*pte));826826- iommu_free_pages(level_pte);827827- }828828-next:829829- pfn += level_size(level);830830- } while (!first_pte_in_page(++pte) && pfn <= last_pfn);831831-}832832-833833-/*834834- * clear last level (leaf) ptes and free page table pages below the835835- * level we wish to keep intact.836836- */837837-static void dma_pte_free_pagetable(struct dmar_domain *domain,838838- unsigned long start_pfn,839839- unsigned long last_pfn,840840- int retain_level)841841-{842842- dma_pte_clear_range(domain, start_pfn, last_pfn);843843-844844- /* We don't need lock here; nobody else touches the iova range */845845- dma_pte_free_level(domain, agaw_to_level(domain->agaw), retain_level,846846- domain->pgd, 0, start_pfn, last_pfn);847847-848848- /* free pgd */849849- if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {850850- iommu_free_pages(domain->pgd);851851- domain->pgd = NULL;852852- }853853-}854854-855855-/* When a page at a given level is being unlinked from its parent, we don't856856- need to *modify* it at all. All we need to do is make a list of all the857857- pages which can be freed just as soon as we've flushed the IOTLB and we858858- know the hardware page-walk will no longer touch them.859859- The 'pte' argument is the *parent* PTE, pointing to the page that is to860860- be freed. */861861-static void dma_pte_list_pagetables(struct dmar_domain *domain,862862- int level, struct dma_pte *parent_pte,863863- struct iommu_pages_list *freelist)864864-{865865- struct dma_pte *pte = phys_to_virt(dma_pte_addr(parent_pte));866866-867867- iommu_pages_list_add(freelist, pte);868868-869869- if (level == 1)870870- return;871871-872872- do {873873- if (dma_pte_present(pte) && !dma_pte_superpage(pte))874874- dma_pte_list_pagetables(domain, level - 1, pte, freelist);875875- pte++;876876- } while (!first_pte_in_page(pte));877877-}878878-879879-static void dma_pte_clear_level(struct dmar_domain *domain, int level,880880- struct dma_pte *pte, unsigned long pfn,881881- unsigned long start_pfn, unsigned long last_pfn,882882- struct iommu_pages_list *freelist)883883-{884884- struct dma_pte *first_pte = NULL, *last_pte = NULL;885885-886886- pfn = max(start_pfn, pfn);887887- pte = &pte[pfn_level_offset(pfn, level)];888888-889889- do {890890- unsigned long level_pfn = pfn & level_mask(level);891891-892892- if (!dma_pte_present(pte))893893- goto next;894894-895895- /* If range covers entire pagetable, free it */896896- if (start_pfn <= level_pfn &&897897- last_pfn >= level_pfn + level_size(level) - 1) {898898- /* These suborbinate page tables are going away entirely. Don't899899- bother to clear them; we're just going to *free* them. */900900- if (level > 1 && !dma_pte_superpage(pte))901901- dma_pte_list_pagetables(domain, level - 1, pte, freelist);902902-903903- dma_clear_pte(pte);904904- if (!first_pte)905905- first_pte = pte;906906- last_pte = pte;907907- } else if (level > 1) {908908- /* Recurse down into a level that isn't *entirely* obsolete */909909- dma_pte_clear_level(domain, level - 1,910910- phys_to_virt(dma_pte_addr(pte)),911911- level_pfn, start_pfn, last_pfn,912912- freelist);913913- }914914-next:915915- pfn = level_pfn + level_size(level);916916- } while (!first_pte_in_page(++pte) && pfn <= last_pfn);917917-918918- if (first_pte)919919- domain_flush_cache(domain, first_pte,920920- (void *)++last_pte - (void *)first_pte);921921-}922922-923923-/* We can't just free the pages because the IOMMU may still be walking924924- the page tables, and may have cached the intermediate levels. The925925- pages can only be freed after the IOTLB flush has been done. */926926-static void domain_unmap(struct dmar_domain *domain, unsigned long start_pfn,927927- unsigned long last_pfn,928928- struct iommu_pages_list *freelist)929929-{930930- if (WARN_ON(!domain_pfn_supported(domain, last_pfn)) ||931931- WARN_ON(start_pfn > last_pfn))932932- return;933933-934934- /* we don't need lock here; nobody else touches the iova range */935935- dma_pte_clear_level(domain, agaw_to_level(domain->agaw),936936- domain->pgd, 0, start_pfn, last_pfn, freelist);937937-938938- /* free pgd */939939- if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {940940- iommu_pages_list_add(freelist, domain->pgd);941941- domain->pgd = NULL;942942- }943943-}944709945710/* iommu handling */946711static int iommu_alloc_root_entry(struct intel_iommu *iommu)···11471460 domain_lookup_dev_info(domain, iommu, bus, devfn);11481461 u16 did = domain_id_iommu(domain, iommu);11491462 int translation = CONTEXT_TT_MULTI_LEVEL;11501150- struct dma_pte *pgd = domain->pgd;14631463+ struct pt_iommu_vtdss_hw_info pt_info;11511464 struct context_entry *context;11521465 int ret;1153146611541467 if (WARN_ON(!intel_domain_is_ss_paging(domain)))11551468 return -EINVAL;14691469+14701470+ pt_iommu_vtdss_hw_info(&domain->sspt, &pt_info);1156147111571472 pr_debug("Set context mapping for %02x:%02x.%d\n",11581473 bus, PCI_SLOT(devfn), PCI_FUNC(devfn));···11781489 else11791490 translation = CONTEXT_TT_MULTI_LEVEL;1180149111811181- context_set_address_root(context, virt_to_phys(pgd));11821182- context_set_address_width(context, domain->agaw);14921492+ context_set_address_root(context, pt_info.ssptptr);14931493+ context_set_address_width(context, pt_info.aw);11831494 context_set_translation_type(context, translation);11841495 context_set_fault_enable(context);11851496 context_set_present(context);···12221533 return ret;1223153412241535 iommu_enable_pci_ats(info);12251225-12261226- return 0;12271227-}12281228-12291229-/* Return largest possible superpage level for a given mapping */12301230-static int hardware_largepage_caps(struct dmar_domain *domain, unsigned long iov_pfn,12311231- unsigned long phy_pfn, unsigned long pages)12321232-{12331233- int support, level = 1;12341234- unsigned long pfnmerge;12351235-12361236- support = domain->iommu_superpage;12371237-12381238- /* To use a large page, the virtual *and* physical addresses12391239- must be aligned to 2MiB/1GiB/etc. Lower bits set in either12401240- of them will mean we have to use smaller pages. So just12411241- merge them and check both at once. */12421242- pfnmerge = iov_pfn | phy_pfn;12431243-12441244- while (support && !(pfnmerge & ~VTD_STRIDE_MASK)) {12451245- pages >>= VTD_STRIDE_SHIFT;12461246- if (!pages)12471247- break;12481248- pfnmerge >>= VTD_STRIDE_SHIFT;12491249- level++;12501250- support--;12511251- }12521252- return level;12531253-}12541254-12551255-/*12561256- * Ensure that old small page tables are removed to make room for superpage(s).12571257- * We're going to add new large pages, so make sure we don't remove their parent12581258- * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.12591259- */12601260-static void switch_to_super_page(struct dmar_domain *domain,12611261- unsigned long start_pfn,12621262- unsigned long end_pfn, int level)12631263-{12641264- unsigned long lvl_pages = lvl_to_nr_pages(level);12651265- struct dma_pte *pte = NULL;12661266-12671267- if (WARN_ON(!IS_ALIGNED(start_pfn, lvl_pages) ||12681268- !IS_ALIGNED(end_pfn + 1, lvl_pages)))12691269- return;12701270-12711271- while (start_pfn <= end_pfn) {12721272- if (!pte)12731273- pte = pfn_to_dma_pte(domain, start_pfn, &level,12741274- GFP_ATOMIC);12751275-12761276- if (dma_pte_present(pte)) {12771277- dma_pte_free_pagetable(domain, start_pfn,12781278- start_pfn + lvl_pages - 1,12791279- level + 1);12801280-12811281- cache_tag_flush_range(domain, start_pfn << VTD_PAGE_SHIFT,12821282- end_pfn << VTD_PAGE_SHIFT, 0);12831283- }12841284-12851285- pte++;12861286- start_pfn += lvl_pages;12871287- if (first_pte_in_page(pte))12881288- pte = NULL;12891289- }12901290-}12911291-12921292-static int12931293-__domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,12941294- unsigned long phys_pfn, unsigned long nr_pages, int prot,12951295- gfp_t gfp)12961296-{12971297- struct dma_pte *first_pte = NULL, *pte = NULL;12981298- unsigned int largepage_lvl = 0;12991299- unsigned long lvl_pages = 0;13001300- phys_addr_t pteval;13011301- u64 attr;13021302-13031303- if (unlikely(!domain_pfn_supported(domain, iov_pfn + nr_pages - 1)))13041304- return -EINVAL;13051305-13061306- if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)13071307- return -EINVAL;13081308-13091309- if (!(prot & DMA_PTE_WRITE) && domain->nested_parent) {13101310- pr_err_ratelimited("Read-only mapping is disallowed on the domain which serves as the parent in a nested configuration, due to HW errata (ERRATA_772415_SPR17)\n");13111311- return -EINVAL;13121312- }13131313-13141314- attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);13151315- if (domain->use_first_level) {13161316- attr |= DMA_FL_PTE_PRESENT | DMA_FL_PTE_US | DMA_FL_PTE_ACCESS;13171317- if (prot & DMA_PTE_WRITE)13181318- attr |= DMA_FL_PTE_DIRTY;13191319- }13201320-13211321- domain->has_mappings = true;13221322-13231323- pteval = ((phys_addr_t)phys_pfn << VTD_PAGE_SHIFT) | attr;13241324-13251325- while (nr_pages > 0) {13261326- uint64_t tmp;13271327-13281328- if (!pte) {13291329- largepage_lvl = hardware_largepage_caps(domain, iov_pfn,13301330- phys_pfn, nr_pages);13311331-13321332- pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,13331333- gfp);13341334- if (!pte)13351335- return -ENOMEM;13361336- first_pte = pte;13371337-13381338- lvl_pages = lvl_to_nr_pages(largepage_lvl);13391339-13401340- /* It is large page*/13411341- if (largepage_lvl > 1) {13421342- unsigned long end_pfn;13431343- unsigned long pages_to_remove;13441344-13451345- pteval |= DMA_PTE_LARGE_PAGE;13461346- pages_to_remove = min_t(unsigned long,13471347- round_down(nr_pages, lvl_pages),13481348- nr_pte_to_next_page(pte) * lvl_pages);13491349- end_pfn = iov_pfn + pages_to_remove - 1;13501350- switch_to_super_page(domain, iov_pfn, end_pfn, largepage_lvl);13511351- } else {13521352- pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;13531353- }13541354-13551355- }13561356- /* We don't need lock here, nobody else13571357- * touches the iova range13581358- */13591359- tmp = 0ULL;13601360- if (!try_cmpxchg64_local(&pte->val, &tmp, pteval)) {13611361- static int dumps = 5;13621362- pr_crit("ERROR: DMA PTE for vPFN 0x%lx already set (to %llx not %llx)\n",13631363- iov_pfn, tmp, (unsigned long long)pteval);13641364- if (dumps) {13651365- dumps--;13661366- debug_dma_dump_mappings(NULL);13671367- }13681368- WARN_ON(1);13691369- }13701370-13711371- nr_pages -= lvl_pages;13721372- iov_pfn += lvl_pages;13731373- phys_pfn += lvl_pages;13741374- pteval += lvl_pages * VTD_PAGE_SIZE;13751375-13761376- /* If the next PTE would be the first in a new page, then we13771377- * need to flush the cache on the entries we've just written.13781378- * And then we'll need to recalculate 'pte', so clear it and13791379- * let it get set again in the if (!pte) block above.13801380- *13811381- * If we're done (!nr_pages) we need to flush the cache too.13821382- *13831383- * Also if we've been setting superpages, we may need to13841384- * recalculate 'pte' and switch back to smaller pages for the13851385- * end of the mapping, if the trailing size is not enough to13861386- * use another superpage (i.e. nr_pages < lvl_pages).13871387- */13881388- pte++;13891389- if (!nr_pages || first_pte_in_page(pte) ||13901390- (largepage_lvl > 1 && nr_pages < lvl_pages)) {13911391- domain_flush_cache(domain, first_pte,13921392- (void *)pte - (void *)first_pte);13931393- pte = NULL;13941394- }13951395- }1396153613971537 return 0;13981538}···12871769 struct device *dev,12881770 u32 pasid, struct iommu_domain *old)12891771{12901290- struct dma_pte *pgd = domain->pgd;12911291- int level, flags = 0;17721772+ struct pt_iommu_x86_64_hw_info pt_info;17731773+ unsigned int flags = 0;1292177412931293- level = agaw_to_level(domain->agaw);12941294- if (level != 4 && level != 5)17751775+ pt_iommu_x86_64_hw_info(&domain->fspt, &pt_info);17761776+ if (WARN_ON(pt_info.levels != 4 && pt_info.levels != 5))12951777 return -EINVAL;1296177812971297- if (level == 5)17791779+ if (pt_info.levels == 5)12981780 flags |= PASID_FLAG_FL5LP;1299178113001782 if (domain->force_snooping)13011783 flags |= PASID_FLAG_PAGE_SNOOP;1302178417851785+ if (!(domain->fspt.x86_64_pt.common.features &17861786+ BIT(PT_FEAT_DMA_INCOHERENT)))17871787+ flags |= PASID_FLAG_PWSNP;17881788+13031789 return __domain_setup_first_level(iommu, dev, pasid,13041790 domain_id_iommu(domain, iommu),13051305- __pa(pgd), flags, old);17911791+ pt_info.gcr3_pt, flags, old);13061792}1307179313081794static int dmar_domain_attach_device(struct dmar_domain *domain,···27523230}2753323127543232static int blocking_domain_attach_dev(struct iommu_domain *domain,27552755- struct device *dev)32333233+ struct device *dev,32343234+ struct iommu_domain *old)27563235{27573236 struct device_domain_info *info = dev_iommu_priv_get(dev);27583237···27743251 }27753252};2776325327772777-static int iommu_superpage_capability(struct intel_iommu *iommu, bool first_stage)32543254+static struct dmar_domain *paging_domain_alloc(void)27783255{27792779- if (!intel_iommu_superpage)27802780- return 0;27812781-27822782- if (first_stage)27832783- return cap_fl1gp_support(iommu->cap) ? 2 : 1;27842784-27852785- return fls(cap_super_page_val(iommu->cap));27862786-}27872787-27882788-static struct dmar_domain *paging_domain_alloc(struct device *dev, bool first_stage)27892789-{27902790- struct device_domain_info *info = dev_iommu_priv_get(dev);27912791- struct intel_iommu *iommu = info->iommu;27923256 struct dmar_domain *domain;27932793- int addr_width;2794325727953258 domain = kzalloc(sizeof(*domain), GFP_KERNEL);27963259 if (!domain)···27913282 INIT_LIST_HEAD(&domain->s1_domains);27923283 spin_lock_init(&domain->s1_lock);2793328427942794- domain->nid = dev_to_node(dev);27952795- domain->use_first_level = first_stage;32853285+ return domain;32863286+}2796328727972797- domain->domain.type = IOMMU_DOMAIN_UNMANAGED;27982798-27992799- /* calculate the address width */28002800- addr_width = agaw_to_width(iommu->agaw);28012801- if (addr_width > cap_mgaw(iommu->cap))28022802- addr_width = cap_mgaw(iommu->cap);28032803- domain->gaw = addr_width;28042804- domain->agaw = iommu->agaw;28052805- domain->max_addr = __DOMAIN_MAX_ADDR(addr_width);28062806-28072807- /* iommu memory access coherency */28082808- domain->iommu_coherency = iommu_paging_structure_coherency(iommu);28092809-28102810- /* pagesize bitmap */28112811- domain->domain.pgsize_bitmap = SZ_4K;28122812- domain->iommu_superpage = iommu_superpage_capability(iommu, first_stage);28132813- domain->domain.pgsize_bitmap |= domain_super_pgsize_bitmap(domain);32883288+static unsigned int compute_vasz_lg2_fs(struct intel_iommu *iommu,32893289+ unsigned int *top_level)32903290+{32913291+ unsigned int mgaw = cap_mgaw(iommu->cap);2814329228153293 /*28162816- * IOVA aperture: First-level translation restricts the input-address28172817- * to a canonical address (i.e., address bits 63:N have the same value28182818- * as address bit [N-1], where N is 48-bits with 4-level paging and28192819- * 57-bits with 5-level paging). Hence, skip bit [N-1].32943294+ * Spec 3.6 First-Stage Translation:32953295+ *32963296+ * Software must limit addresses to less than the minimum of MGAW32973297+ * and the lower canonical address width implied by FSPM (i.e.,32983298+ * 47-bit when FSPM is 4-level and 56-bit when FSPM is 5-level).28203299 */28212821- domain->domain.geometry.force_aperture = true;28222822- domain->domain.geometry.aperture_start = 0;28232823- if (first_stage)28242824- domain->domain.geometry.aperture_end = __DOMAIN_MAX_ADDR(domain->gaw - 1);28252825- else28262826- domain->domain.geometry.aperture_end = __DOMAIN_MAX_ADDR(domain->gaw);28272827-28282828- /* always allocate the top pgd */28292829- domain->pgd = iommu_alloc_pages_node_sz(domain->nid, GFP_KERNEL, SZ_4K);28302830- if (!domain->pgd) {28312831- kfree(domain);28322832- return ERR_PTR(-ENOMEM);33003300+ if (mgaw > 48 && cap_fl5lp_support(iommu->cap)) {33013301+ *top_level = 4;33023302+ return min(57, mgaw);28333303 }28342834- domain_flush_cache(domain, domain->pgd, PAGE_SIZE);2835330428362836- return domain;33053305+ /* Four level is always supported */33063306+ *top_level = 3;33073307+ return min(48, mgaw);28373308}2838330928393310static struct iommu_domain *28403311intel_iommu_domain_alloc_first_stage(struct device *dev,28413312 struct intel_iommu *iommu, u32 flags)28423313{33143314+ struct pt_iommu_x86_64_cfg cfg = {};28433315 struct dmar_domain *dmar_domain;33163316+ int ret;2844331728453318 if (flags & ~IOMMU_HWPT_ALLOC_PASID)28463319 return ERR_PTR(-EOPNOTSUPP);···28313340 if (!sm_supported(iommu) || !ecap_flts(iommu->ecap))28323341 return ERR_PTR(-EOPNOTSUPP);2833334228342834- dmar_domain = paging_domain_alloc(dev, true);33433343+ dmar_domain = paging_domain_alloc();28353344 if (IS_ERR(dmar_domain))28363345 return ERR_CAST(dmar_domain);2837334633473347+ cfg.common.hw_max_vasz_lg2 =33483348+ compute_vasz_lg2_fs(iommu, &cfg.top_level);33493349+ cfg.common.hw_max_oasz_lg2 = 52;33503350+ cfg.common.features = BIT(PT_FEAT_SIGN_EXTEND) |33513351+ BIT(PT_FEAT_FLUSH_RANGE);33523352+ /* First stage always uses scalable mode */33533353+ if (!ecap_smpwc(iommu->ecap))33543354+ cfg.common.features |= BIT(PT_FEAT_DMA_INCOHERENT);33553355+ dmar_domain->iommu.iommu_device = dev;33563356+ dmar_domain->iommu.nid = dev_to_node(dev);28383357 dmar_domain->domain.ops = &intel_fs_paging_domain_ops;28393358 /*28403359 * iotlb sync for map is only needed for legacy implementations that···28543353 if (rwbf_required(iommu))28553354 dmar_domain->iotlb_sync_map = true;2856335533563356+ ret = pt_iommu_x86_64_init(&dmar_domain->fspt, &cfg, GFP_KERNEL);33573357+ if (ret) {33583358+ kfree(dmar_domain);33593359+ return ERR_PTR(ret);33603360+ }33613361+33623362+ if (!cap_fl1gp_support(iommu->cap))33633363+ dmar_domain->domain.pgsize_bitmap &= ~(u64)SZ_1G;33643364+ if (!intel_iommu_superpage)33653365+ dmar_domain->domain.pgsize_bitmap = SZ_4K;33663366+28573367 return &dmar_domain->domain;28583368}33693369+33703370+static unsigned int compute_vasz_lg2_ss(struct intel_iommu *iommu,33713371+ unsigned int *top_level)33723372+{33733373+ unsigned int sagaw = cap_sagaw(iommu->cap);33743374+ unsigned int mgaw = cap_mgaw(iommu->cap);33753375+33763376+ /*33773377+ * Find the largest table size that both the mgaw and sagaw support.33783378+ * This sets the valid range of IOVA and the top starting level.33793379+ * Some HW may only support a 4 or 5 level walk but must limit IOVA to33803380+ * 3 levels.33813381+ */33823382+ if (mgaw > 48 && sagaw >= BIT(3)) {33833383+ *top_level = 4;33843384+ return min(57, mgaw);33853385+ } else if (mgaw > 39 && sagaw >= BIT(2)) {33863386+ *top_level = 3 + ffs(sagaw >> 3);33873387+ return min(48, mgaw);33883388+ } else if (mgaw > 30 && sagaw >= BIT(1)) {33893389+ *top_level = 2 + ffs(sagaw >> 2);33903390+ return min(39, mgaw);33913391+ }33923392+ return 0;33933393+}33943394+33953395+static const struct iommu_dirty_ops intel_second_stage_dirty_ops = {33963396+ IOMMU_PT_DIRTY_OPS(vtdss),33973397+ .set_dirty_tracking = intel_iommu_set_dirty_tracking,33983398+};2859339928603400static struct iommu_domain *28613401intel_iommu_domain_alloc_second_stage(struct device *dev,28623402 struct intel_iommu *iommu, u32 flags)28633403{34043404+ struct pt_iommu_vtdss_cfg cfg = {};28643405 struct dmar_domain *dmar_domain;34063406+ unsigned int sslps;34073407+ int ret;2865340828663409 if (flags &28673410 (~(IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING |···29223377 if (sm_supported(iommu) && !ecap_slts(iommu->ecap))29233378 return ERR_PTR(-EOPNOTSUPP);2924337929252925- dmar_domain = paging_domain_alloc(dev, false);33803380+ dmar_domain = paging_domain_alloc();29263381 if (IS_ERR(dmar_domain))29273382 return ERR_CAST(dmar_domain);2928338333843384+ cfg.common.hw_max_vasz_lg2 = compute_vasz_lg2_ss(iommu, &cfg.top_level);33853385+ cfg.common.hw_max_oasz_lg2 = 52;33863386+ cfg.common.features = BIT(PT_FEAT_FLUSH_RANGE);33873387+33883388+ /*33893389+ * Read-only mapping is disallowed on the domain which serves as the33903390+ * parent in a nested configuration, due to HW errata33913391+ * (ERRATA_772415_SPR17)33923392+ */33933393+ if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT)33943394+ cfg.common.features |= BIT(PT_FEAT_VTDSS_FORCE_WRITEABLE);33953395+33963396+ if (!iommu_paging_structure_coherency(iommu))33973397+ cfg.common.features |= BIT(PT_FEAT_DMA_INCOHERENT);33983398+ dmar_domain->iommu.iommu_device = dev;33993399+ dmar_domain->iommu.nid = dev_to_node(dev);29293400 dmar_domain->domain.ops = &intel_ss_paging_domain_ops;29303401 dmar_domain->nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;2931340229323403 if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)29332933- dmar_domain->domain.dirty_ops = &intel_dirty_ops;34043404+ dmar_domain->domain.dirty_ops = &intel_second_stage_dirty_ops;34053405+34063406+ ret = pt_iommu_vtdss_init(&dmar_domain->sspt, &cfg, GFP_KERNEL);34073407+ if (ret) {34083408+ kfree(dmar_domain);34093409+ return ERR_PTR(ret);34103410+ }34113411+34123412+ /* Adjust the supported page sizes to HW capability */34133413+ sslps = cap_super_page_val(iommu->cap);34143414+ if (!(sslps & BIT(0)))34153415+ dmar_domain->domain.pgsize_bitmap &= ~(u64)SZ_2M;34163416+ if (!(sslps & BIT(1)))34173417+ dmar_domain->domain.pgsize_bitmap &= ~(u64)SZ_1G;34183418+ if (!intel_iommu_superpage)34193419+ dmar_domain->domain.pgsize_bitmap = SZ_4K;2934342029353421 /*29363422 * Besides the internal write buffer flush, the caching mode used for···30033427 if (WARN_ON(!list_empty(&dmar_domain->devices)))30043428 return;3005342930063006- if (dmar_domain->pgd) {30073007- struct iommu_pages_list freelist =30083008- IOMMU_PAGES_LIST_INIT(freelist);30093009-30103010- domain_unmap(dmar_domain, 0, DOMAIN_MAX_PFN(dmar_domain->gaw),30113011- &freelist);30123012- iommu_put_pages_list(&freelist);30133013- }34303430+ pt_iommu_deinit(&dmar_domain->iommu);3014343130153432 kfree(dmar_domain->qi_batch);30163433 kfree(dmar_domain);···3018344930193450 /* Only SL is available in legacy mode */30203451 if (!sm_supported(iommu) || !ecap_flts(iommu->ecap))34523452+ return -EINVAL;34533453+34543454+ if (!ecap_smpwc(iommu->ecap) &&34553455+ !(dmar_domain->fspt.x86_64_pt.common.features &34563456+ BIT(PT_FEAT_DMA_INCOHERENT)))34573457+ return -EINVAL;34583458+34593459+ /* Supports the number of table levels */34603460+ if (!cap_fl5lp_support(iommu->cap) &&34613461+ dmar_domain->fspt.x86_64_pt.common.max_vasz_lg2 > 48)30213462 return -EINVAL;3022346330233464 /* Same page size support */···30463467paging_domain_compatible_second_stage(struct dmar_domain *dmar_domain,30473468 struct intel_iommu *iommu)30483469{34703470+ unsigned int vasz_lg2 = dmar_domain->sspt.vtdss_pt.common.max_vasz_lg2;30493471 unsigned int sslps = cap_super_page_val(iommu->cap);34723472+ struct pt_iommu_vtdss_hw_info pt_info;34733473+34743474+ pt_iommu_vtdss_hw_info(&dmar_domain->sspt, &pt_info);3050347530513476 if (dmar_domain->domain.dirty_ops && !ssads_supported(iommu))30523477 return -EINVAL;···3059347630603477 /* Legacy mode always supports second stage */30613478 if (sm_supported(iommu) && !ecap_slts(iommu->ecap))34793479+ return -EINVAL;34803480+34813481+ if (!iommu_paging_structure_coherency(iommu) &&34823482+ !(dmar_domain->sspt.vtdss_pt.common.features &34833483+ BIT(PT_FEAT_DMA_INCOHERENT)))34843484+ return -EINVAL;34853485+34863486+ /* Address width falls within the capability */34873487+ if (cap_mgaw(iommu->cap) < vasz_lg2)34883488+ return -EINVAL;34893489+34903490+ /* Page table level is supported. */34913491+ if (!(cap_sagaw(iommu->cap) & BIT(pt_info.aw)))30623492 return -EINVAL;3063349330643494 /* Same page size support */···30853489 !dmar_domain->iotlb_sync_map)30863490 return -EINVAL;3087349134923492+ /*34933493+ * FIXME this is locked wrong, it needs to be under the34943494+ * dmar_domain->lock34953495+ */34963496+ if ((dmar_domain->sspt.vtdss_pt.common.features &34973497+ BIT(PT_FEAT_VTDSS_FORCE_COHERENCE)) &&34983498+ !ecap_sc_support(iommu->ecap))34993499+ return -EINVAL;30883500 return 0;30893501}30903502···31023498 struct dmar_domain *dmar_domain = to_dmar_domain(domain);31033499 struct intel_iommu *iommu = info->iommu;31043500 int ret = -EINVAL;31053105- int addr_width;3106350131073502 if (intel_domain_is_fs_paging(dmar_domain))31083503 ret = paging_domain_compatible_first_stage(dmar_domain, iommu);···31123509 if (ret)31133510 return ret;3114351131153115- /*31163116- * FIXME this is locked wrong, it needs to be under the31173117- * dmar_domain->lock31183118- */31193119- if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))31203120- return -EINVAL;31213121-31223122- if (dmar_domain->iommu_coherency !=31233123- iommu_paging_structure_coherency(iommu))31243124- return -EINVAL;31253125-31263126-31273127- /* check if this iommu agaw is sufficient for max mapped address */31283128- addr_width = agaw_to_width(iommu->agaw);31293129- if (addr_width > cap_mgaw(iommu->cap))31303130- addr_width = cap_mgaw(iommu->cap);31313131-31323132- if (dmar_domain->gaw > addr_width || dmar_domain->agaw > iommu->agaw)31333133- return -EINVAL;31343134-31353512 if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&31363513 context_copied(iommu, info->bus, info->devfn))31373514 return intel_pasid_setup_sm_context(dev);···31203537}3121353831223539static int intel_iommu_attach_device(struct iommu_domain *domain,31233123- struct device *dev)35403540+ struct device *dev,35413541+ struct iommu_domain *old)31243542{31253543 int ret;31263544···31423558 return ret;31433559}3144356031453145-static int intel_iommu_map(struct iommu_domain *domain,31463146- unsigned long iova, phys_addr_t hpa,31473147- size_t size, int iommu_prot, gfp_t gfp)31483148-{31493149- struct dmar_domain *dmar_domain = to_dmar_domain(domain);31503150- u64 max_addr;31513151- int prot = 0;31523152-31533153- if (iommu_prot & IOMMU_READ)31543154- prot |= DMA_PTE_READ;31553155- if (iommu_prot & IOMMU_WRITE)31563156- prot |= DMA_PTE_WRITE;31573157- if (dmar_domain->set_pte_snp)31583158- prot |= DMA_PTE_SNP;31593159-31603160- max_addr = iova + size;31613161- if (dmar_domain->max_addr < max_addr) {31623162- u64 end;31633163-31643164- /* check if minimum agaw is sufficient for mapped address */31653165- end = __DOMAIN_MAX_ADDR(dmar_domain->gaw) + 1;31663166- if (end < max_addr) {31673167- pr_err("%s: iommu width (%d) is not "31683168- "sufficient for the mapped address (%llx)\n",31693169- __func__, dmar_domain->gaw, max_addr);31703170- return -EFAULT;31713171- }31723172- dmar_domain->max_addr = max_addr;31733173- }31743174- /* Round up size to next multiple of PAGE_SIZE, if it and31753175- the low bits of hpa would take us onto the next page */31763176- size = aligned_nrpages(hpa, size);31773177- return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,31783178- hpa >> VTD_PAGE_SHIFT, size, prot, gfp);31793179-}31803180-31813181-static int intel_iommu_map_pages(struct iommu_domain *domain,31823182- unsigned long iova, phys_addr_t paddr,31833183- size_t pgsize, size_t pgcount,31843184- int prot, gfp_t gfp, size_t *mapped)31853185-{31863186- unsigned long pgshift = __ffs(pgsize);31873187- size_t size = pgcount << pgshift;31883188- int ret;31893189-31903190- if (pgsize != SZ_4K && pgsize != SZ_2M && pgsize != SZ_1G)31913191- return -EINVAL;31923192-31933193- if (!IS_ALIGNED(iova | paddr, pgsize))31943194- return -EINVAL;31953195-31963196- ret = intel_iommu_map(domain, iova, paddr, size, prot, gfp);31973197- if (!ret && mapped)31983198- *mapped = size;31993199-32003200- return ret;32013201-}32023202-32033203-static size_t intel_iommu_unmap(struct iommu_domain *domain,32043204- unsigned long iova, size_t size,32053205- struct iommu_iotlb_gather *gather)32063206-{32073207- struct dmar_domain *dmar_domain = to_dmar_domain(domain);32083208- unsigned long start_pfn, last_pfn;32093209- int level = 0;32103210-32113211- /* Cope with horrid API which requires us to unmap more than the32123212- size argument if it happens to be a large-page mapping. */32133213- if (unlikely(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,32143214- &level, GFP_ATOMIC)))32153215- return 0;32163216-32173217- if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))32183218- size = VTD_PAGE_SIZE << level_to_offset_bits(level);32193219-32203220- start_pfn = iova >> VTD_PAGE_SHIFT;32213221- last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;32223222-32233223- domain_unmap(dmar_domain, start_pfn, last_pfn, &gather->freelist);32243224-32253225- if (dmar_domain->max_addr == iova + size)32263226- dmar_domain->max_addr = iova;32273227-32283228- /*32293229- * We do not use page-selective IOTLB invalidation in flush queue,32303230- * so there is no need to track page and sync iotlb.32313231- */32323232- if (!iommu_iotlb_gather_queued(gather))32333233- iommu_iotlb_gather_add_page(domain, gather, iova, size);32343234-32353235- return size;32363236-}32373237-32383238-static size_t intel_iommu_unmap_pages(struct iommu_domain *domain,32393239- unsigned long iova,32403240- size_t pgsize, size_t pgcount,32413241- struct iommu_iotlb_gather *gather)32423242-{32433243- unsigned long pgshift = __ffs(pgsize);32443244- size_t size = pgcount << pgshift;32453245-32463246- return intel_iommu_unmap(domain, iova, size, gather);32473247-}32483248-32493561static void intel_iommu_tlb_sync(struct iommu_domain *domain,32503562 struct iommu_iotlb_gather *gather)32513563{···31493669 gather->end,31503670 iommu_pages_list_empty(&gather->freelist));31513671 iommu_put_pages_list(&gather->freelist);31523152-}31533153-31543154-static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,31553155- dma_addr_t iova)31563156-{31573157- struct dmar_domain *dmar_domain = to_dmar_domain(domain);31583158- struct dma_pte *pte;31593159- int level = 0;31603160- u64 phys = 0;31613161-31623162- pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,31633163- GFP_ATOMIC);31643164- if (pte && dma_pte_present(pte))31653165- phys = dma_pte_addr(pte) +31663166- (iova & (BIT_MASK(level_to_offset_bits(level) +31673167- VTD_PAGE_SHIFT) - 1));31683168-31693169- return phys;31703672}3171367331723674static bool domain_support_force_snooping(struct dmar_domain *domain)···31923730 struct dmar_domain *dmar_domain = to_dmar_domain(domain);3193373131943732 guard(spinlock_irqsave)(&dmar_domain->lock);31953195- if (!domain_support_force_snooping(dmar_domain) ||31963196- dmar_domain->has_mappings)37333733+ if (!domain_support_force_snooping(dmar_domain))31973734 return false;3198373531993736 /*32003737 * Second level page table supports per-PTE snoop control. The32013738 * iommu_map() interface will handle this by setting SNP bit.32023739 */32033203- dmar_domain->set_pte_snp = true;37403740+ dmar_domain->sspt.vtdss_pt.common.features |=37413741+ BIT(PT_FEAT_VTDSS_FORCE_COHERENCE);32043742 dmar_domain->force_snooping = true;32053743 return true;32063744}···37644302 return ret;37654303}3766430437673767-static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,37683768- unsigned long iova, size_t size,37693769- unsigned long flags,37703770- struct iommu_dirty_bitmap *dirty)37713771-{37723772- struct dmar_domain *dmar_domain = to_dmar_domain(domain);37733773- unsigned long end = iova + size - 1;37743774- unsigned long pgsize;37753775-37763776- /*37773777- * IOMMUFD core calls into a dirty tracking disabled domain without an37783778- * IOVA bitmap set in order to clean dirty bits in all PTEs that might37793779- * have occurred when we stopped dirty tracking. This ensures that we37803780- * never inherit dirtied bits from a previous cycle.37813781- */37823782- if (!dmar_domain->dirty_tracking && dirty->bitmap)37833783- return -EINVAL;37843784-37853785- do {37863786- struct dma_pte *pte;37873787- int lvl = 0;37883788-37893789- pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,37903790- GFP_ATOMIC);37913791- pgsize = level_size(lvl) << VTD_PAGE_SHIFT;37923792- if (!pte || !dma_pte_present(pte)) {37933793- iova += pgsize;37943794- continue;37953795- }37963796-37973797- if (dma_sl_pte_test_and_clear_dirty(pte, flags))37983798- iommu_dirty_bitmap_record(dirty, iova, pgsize);37993799- iova += pgsize;38003800- } while (iova < end);38013801-38023802- return 0;38033803-}38043804-38053805-static const struct iommu_dirty_ops intel_dirty_ops = {38063806- .set_dirty_tracking = intel_iommu_set_dirty_tracking,38073807- .read_and_clear_dirty = intel_iommu_read_and_clear_dirty,38083808-};38093809-38104305static int context_setup_pass_through(struct device *dev, u8 bus, u8 devfn)38114306{38124307 struct device_domain_info *info = dev_iommu_priv_get(dev);···38204401 context_setup_pass_through_cb, dev);38214402}3822440338233823-static int identity_domain_attach_dev(struct iommu_domain *domain, struct device *dev)44044404+static int identity_domain_attach_dev(struct iommu_domain *domain,44054405+ struct device *dev,44064406+ struct iommu_domain *old)38244407{38254408 struct device_domain_info *info = dev_iommu_priv_get(dev);38264409 struct intel_iommu *iommu = info->iommu;···38834462};3884446338854464const struct iommu_domain_ops intel_fs_paging_domain_ops = {44654465+ IOMMU_PT_DOMAIN_OPS(x86_64),38864466 .attach_dev = intel_iommu_attach_device,38874467 .set_dev_pasid = intel_iommu_set_dev_pasid,38883888- .map_pages = intel_iommu_map_pages,38893889- .unmap_pages = intel_iommu_unmap_pages,38904468 .iotlb_sync_map = intel_iommu_iotlb_sync_map,38914469 .flush_iotlb_all = intel_flush_iotlb_all,38924470 .iotlb_sync = intel_iommu_tlb_sync,38933893- .iova_to_phys = intel_iommu_iova_to_phys,38944471 .free = intel_iommu_domain_free,38954472 .enforce_cache_coherency = intel_iommu_enforce_cache_coherency_fs,38964473};3897447438984475const struct iommu_domain_ops intel_ss_paging_domain_ops = {44764476+ IOMMU_PT_DOMAIN_OPS(vtdss),38994477 .attach_dev = intel_iommu_attach_device,39004478 .set_dev_pasid = intel_iommu_set_dev_pasid,39013901- .map_pages = intel_iommu_map_pages,39023902- .unmap_pages = intel_iommu_unmap_pages,39034479 .iotlb_sync_map = intel_iommu_iotlb_sync_map,39044480 .flush_iotlb_all = intel_flush_iotlb_all,39054481 .iotlb_sync = intel_iommu_tlb_sync,39063906- .iova_to_phys = intel_iommu_iova_to_phys,39074482 .free = intel_iommu_domain_free,39084483 .enforce_cache_coherency = intel_iommu_enforce_cache_coherency_ss,39094484};···4214479742154798 return ret;42164799}48004800+48014801+MODULE_IMPORT_NS("GENERIC_PT_IOMMU");
+15-84
drivers/iommu/intel/iommu.h
···2323#include <linux/xarray.h>2424#include <linux/perf_event.h>2525#include <linux/pci.h>2626+#include <linux/generic_pt/iommu.h>26272727-#include <asm/cacheflush.h>2828#include <asm/iommu.h>2929#include <uapi/linux/iommufd.h>3030···595595};596596597597struct dmar_domain {598598- int nid; /* node id */598598+ union {599599+ struct iommu_domain domain;600600+ struct pt_iommu iommu;601601+ /* First stage page table */602602+ struct pt_iommu_x86_64 fspt;603603+ /* Second stage page table */604604+ struct pt_iommu_vtdss sspt;605605+ };606606+599607 struct xarray iommu_array; /* Attached IOMMU array */600608601601- u8 iommu_coherency: 1; /* indicate coherency of iommu access */602602- u8 force_snooping : 1; /* Create IOPTEs with snoop control */603603- u8 set_pte_snp:1;604604- u8 use_first_level:1; /* DMA translation for the domain goes605605- * through the first level page table,606606- * otherwise, goes through the second607607- * level.608608- */609609+ u8 force_snooping:1; /* Create PASID entry with snoop control */609610 u8 dirty_tracking:1; /* Dirty tracking is enabled */610611 u8 nested_parent:1; /* Has other domains nested on it */611611- u8 has_mappings:1; /* Has mappings configured through612612- * iommu_map() interface.613613- */614612 u8 iotlb_sync_map:1; /* Need to flush IOTLB cache or write615613 * buffer when creating mappings.616614 */···621623 struct list_head cache_tags; /* Cache tag list */622624 struct qi_batch *qi_batch; /* Batched QI descriptors */623625624624- int iommu_superpage;/* Level of superpages supported:625625- 0 == 4KiB (no superpages), 1 == 2MiB,626626- 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */627626 union {628627 /* DMA remapping domain */629628 struct {630630- /* virtual address */631631- struct dma_pte *pgd;632632- /* max guest address width */633633- int gaw;634634- /*635635- * adjusted guest address width:636636- * 0: level 2 30-bit637637- * 1: level 3 39-bit638638- * 2: level 4 48-bit639639- * 3: level 5 57-bit640640- */641641- int agaw;642642- /* maximum mapped address */643643- u64 max_addr;644629 /* Protect the s1_domains list */645630 spinlock_t s1_lock;646631 /* Track s1_domains nested on this domain */···645664 struct mmu_notifier notifier;646665 };647666 };648648-649649- struct iommu_domain domain; /* generic domain data structure for650650- iommu core */651667};668668+PT_IOMMU_CHECK_DOMAIN(struct dmar_domain, iommu, domain);669669+PT_IOMMU_CHECK_DOMAIN(struct dmar_domain, sspt.iommu, domain);670670+PT_IOMMU_CHECK_DOMAIN(struct dmar_domain, fspt.iommu, domain);652671653672/*654673 * In theory, the VT-d 4.0 spec can support up to 2 ^ 16 counters.···847866 u64 val;848867};849868850850-static inline void dma_clear_pte(struct dma_pte *pte)851851-{852852- pte->val = 0;853853-}854854-855869static inline u64 dma_pte_addr(struct dma_pte *pte)856870{857871#ifdef CONFIG_64BIT···862886 return (pte->val & 3) != 0;863887}864888865865-static inline bool dma_sl_pte_test_and_clear_dirty(struct dma_pte *pte,866866- unsigned long flags)867867-{868868- if (flags & IOMMU_DIRTY_NO_CLEAR)869869- return (pte->val & DMA_SL_PTE_DIRTY) != 0;870870-871871- return test_and_clear_bit(DMA_SL_PTE_DIRTY_BIT,872872- (unsigned long *)&pte->val);873873-}874874-875889static inline bool dma_pte_superpage(struct dma_pte *pte)876890{877891 return (pte->val & DMA_PTE_LARGE_PAGE);878878-}879879-880880-static inline bool first_pte_in_page(struct dma_pte *pte)881881-{882882- return IS_ALIGNED((unsigned long)pte, VTD_PAGE_SIZE);883883-}884884-885885-static inline int nr_pte_to_next_page(struct dma_pte *pte)886886-{887887- return first_pte_in_page(pte) ? BIT_ULL(VTD_STRIDE_SHIFT) :888888- (struct dma_pte *)ALIGN((unsigned long)pte, VTD_PAGE_SIZE) - pte;889892}890893891894static inline bool context_present(struct context_entry *context)···882927 return agaw + 2;883928}884929885885-static inline int agaw_to_width(int agaw)886886-{887887- return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);888888-}889889-890930static inline int width_to_agaw(int width)891931{892932 return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);···897947 return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;898948}899949900900-static inline u64 level_mask(int level)901901-{902902- return -1ULL << level_to_offset_bits(level);903903-}904904-905905-static inline u64 level_size(int level)906906-{907907- return 1ULL << level_to_offset_bits(level);908908-}909909-910910-static inline u64 align_to_level(u64 pfn, int level)911911-{912912- return (pfn + level_size(level) - 1) & level_mask(level);913913-}914914-915915-static inline unsigned long lvl_to_nr_pages(unsigned int lvl)916916-{917917- return 1UL << min_t(int, (lvl - 1) * LEVEL_STRIDE, MAX_AGAW_PFN_WIDTH);918918-}919950920951static inline void context_set_present(struct context_entry *context)921952{···10281097 struct qi_desc *desc)10291098{10301099 u8 dw = 0, dr = 0;10311031- int ih = 0;11001100+ int ih = addr & 1;1032110110331102 if (cap_write_drain(iommu->cap))10341103 dw = 1;
+1-6
drivers/iommu/intel/nested.c
···1919#include "pasid.h"20202121static int intel_nested_attach_dev(struct iommu_domain *domain,2222- struct device *dev)2222+ struct device *dev, struct iommu_domain *old)2323{2424 struct device_domain_info *info = dev_iommu_priv_get(dev);2525 struct dmar_domain *dmar_domain = to_dmar_domain(domain);···2828 int ret = 0;29293030 device_block_translation(dev);3131-3232- if (iommu->agaw < dmar_domain->s2_domain->agaw) {3333- dev_err_ratelimited(dev, "Adjusted guest address width not compatible\n");3434- return -ENODEV;3535- }36313732 /*3833 * Stage-1 domain cannot work alone, it is nested on a s2_domain.
···24242525#define PASID_FLAG_NESTED BIT(1)2626#define PASID_FLAG_PAGE_SNOOP BIT(2)2727+#define PASID_FLAG_PWSNP BIT(2)27282829/*2930 * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
···44 * Pasha Tatashin <pasha.tatashin@soleen.com>55 */66#include "iommu-pages.h"77+#include <linux/dma-mapping.h>78#include <linux/gfp.h>89#include <linux/mm.h>910···2322#undef IOPTDESC_MATCH2423static_assert(sizeof(struct ioptdesc) <= sizeof(struct page));25242525+static inline size_t ioptdesc_mem_size(struct ioptdesc *desc)2626+{2727+ return 1UL << (folio_order(ioptdesc_folio(desc)) + PAGE_SHIFT);2828+}2929+2630/**2731 * iommu_alloc_pages_node_sz - Allocate a zeroed page of a given size from2832 * specific NUMA node···4236 */4337void *iommu_alloc_pages_node_sz(int nid, gfp_t gfp, size_t size)4438{3939+ struct ioptdesc *iopt;4540 unsigned long pgcnt;4641 struct folio *folio;4742 unsigned int order;···6760 if (unlikely(!folio))6861 return NULL;69626363+ iopt = folio_ioptdesc(folio);6464+ iopt->incoherent = false;6565+7066 /*7167 * All page allocations that should be reported to as "iommu-pagetables"7268 * to userspace must use one of the functions below. This includes···9080static void __iommu_free_desc(struct ioptdesc *iopt)9181{9282 struct folio *folio = ioptdesc_folio(iopt);9393- const unsigned long pgcnt = 1UL << folio_order(folio);8383+ const unsigned long pgcnt = folio_nr_pages(folio);8484+8585+ if (IOMMU_PAGES_USE_DMA_API)8686+ WARN_ON_ONCE(iopt->incoherent);94879588 mod_node_page_state(folio_pgdat(folio), NR_IOMMU_PAGES, -pgcnt);9689 lruvec_stat_mod_folio(folio, NR_SECONDARY_PAGETABLE, -pgcnt);···130117 __iommu_free_desc(iopt);131118}132119EXPORT_SYMBOL_GPL(iommu_put_pages_list);120120+121121+/**122122+ * iommu_pages_start_incoherent - Setup the page for cache incoherent operation123123+ * @virt: The page to setup124124+ * @dma_dev: The iommu device125125+ *126126+ * For incoherent memory this will use the DMA API to manage the cache flushing127127+ * on some arches. This is a lot of complexity compared to just calling128128+ * arch_sync_dma_for_device(), but it is what the existing ARM iommu drivers129129+ * have been doing. The DMA API requires keeping track of the DMA map and130130+ * freeing it when required. This keeps track of the dma map inside the ioptdesc131131+ * so that error paths are simple for the caller.132132+ */133133+int iommu_pages_start_incoherent(void *virt, struct device *dma_dev)134134+{135135+ struct ioptdesc *iopt = virt_to_ioptdesc(virt);136136+ dma_addr_t dma;137137+138138+ if (WARN_ON(iopt->incoherent))139139+ return -EINVAL;140140+141141+ if (!IOMMU_PAGES_USE_DMA_API) {142142+ iommu_pages_flush_incoherent(dma_dev, virt, 0,143143+ ioptdesc_mem_size(iopt));144144+ } else {145145+ dma = dma_map_single(dma_dev, virt, ioptdesc_mem_size(iopt),146146+ DMA_TO_DEVICE);147147+ if (dma_mapping_error(dma_dev, dma))148148+ return -EINVAL;149149+150150+ /*151151+ * The DMA API is not allowed to do anything other than DMA152152+ * direct. It would be nice to also check153153+ * dev_is_dma_coherent(dma_dev));154154+ */155155+ if (WARN_ON(dma != virt_to_phys(virt))) {156156+ dma_unmap_single(dma_dev, dma, ioptdesc_mem_size(iopt),157157+ DMA_TO_DEVICE);158158+ return -EOPNOTSUPP;159159+ }160160+ }161161+162162+ iopt->incoherent = 1;163163+ return 0;164164+}165165+EXPORT_SYMBOL_GPL(iommu_pages_start_incoherent);166166+167167+/**168168+ * iommu_pages_start_incoherent_list - Make a list of pages incoherent169169+ * @list: The list of pages to setup170170+ * @dma_dev: The iommu device171171+ *172172+ * Perform iommu_pages_start_incoherent() across all of list.173173+ *174174+ * If this fails the caller must call iommu_pages_stop_incoherent_list().175175+ */176176+int iommu_pages_start_incoherent_list(struct iommu_pages_list *list,177177+ struct device *dma_dev)178178+{179179+ struct ioptdesc *cur;180180+ int ret;181181+182182+ list_for_each_entry(cur, &list->pages, iopt_freelist_elm) {183183+ if (WARN_ON(cur->incoherent))184184+ continue;185185+186186+ ret = iommu_pages_start_incoherent(187187+ folio_address(ioptdesc_folio(cur)), dma_dev);188188+ if (ret)189189+ return ret;190190+ }191191+ return 0;192192+}193193+EXPORT_SYMBOL_GPL(iommu_pages_start_incoherent_list);194194+195195+/**196196+ * iommu_pages_stop_incoherent_list - Undo incoherence across a list197197+ * @list: The list of pages to release198198+ * @dma_dev: The iommu device199199+ *200200+ * Revert iommu_pages_start_incoherent() across all of the list. Pages that did201201+ * not call or succeed iommu_pages_start_incoherent() will be ignored.202202+ */203203+#if IOMMU_PAGES_USE_DMA_API204204+void iommu_pages_stop_incoherent_list(struct iommu_pages_list *list,205205+ struct device *dma_dev)206206+{207207+ struct ioptdesc *cur;208208+209209+ list_for_each_entry(cur, &list->pages, iopt_freelist_elm) {210210+ struct folio *folio = ioptdesc_folio(cur);211211+212212+ if (!cur->incoherent)213213+ continue;214214+ dma_unmap_single(dma_dev, virt_to_phys(folio_address(folio)),215215+ ioptdesc_mem_size(cur), DMA_TO_DEVICE);216216+ cur->incoherent = 0;217217+ }218218+}219219+EXPORT_SYMBOL_GPL(iommu_pages_stop_incoherent_list);220220+221221+/**222222+ * iommu_pages_free_incoherent - Free an incoherent page223223+ * @virt: virtual address of the page to be freed.224224+ * @dma_dev: The iommu device225225+ *226226+ * If the page is incoherent it made coherent again then freed.227227+ */228228+void iommu_pages_free_incoherent(void *virt, struct device *dma_dev)229229+{230230+ struct ioptdesc *iopt = virt_to_ioptdesc(virt);231231+232232+ if (iopt->incoherent) {233233+ dma_unmap_single(dma_dev, virt_to_phys(virt),234234+ ioptdesc_mem_size(iopt), DMA_TO_DEVICE);235235+ iopt->incoherent = 0;236236+ }237237+ __iommu_free_desc(iopt);238238+}239239+EXPORT_SYMBOL_GPL(iommu_pages_free_incoherent);240240+#endif
+49-2
drivers/iommu/iommu-pages.h
···21212222 struct list_head iopt_freelist_elm;2323 unsigned long __page_mapping;2424- pgoff_t __index;2424+ union {2525+ u8 incoherent;2626+ pgoff_t __index;2727+ };2528 void *_private;26292730 unsigned int __page_type;···10198 return iommu_alloc_pages_node_sz(NUMA_NO_NODE, gfp, size);10299}103100104104-#endif /* __IOMMU_PAGES_H */101101+int iommu_pages_start_incoherent(void *virt, struct device *dma_dev);102102+int iommu_pages_start_incoherent_list(struct iommu_pages_list *list,103103+ struct device *dma_dev);104104+105105+#ifdef CONFIG_X86106106+#define IOMMU_PAGES_USE_DMA_API 0107107+#include <linux/cacheflush.h>108108+109109+static inline void iommu_pages_flush_incoherent(struct device *dma_dev,110110+ void *virt, size_t offset,111111+ size_t len)112112+{113113+ clflush_cache_range(virt + offset, len);114114+}115115+static inline void116116+iommu_pages_stop_incoherent_list(struct iommu_pages_list *list,117117+ struct device *dma_dev)118118+{119119+ /*120120+ * For performance leave the incoherent flag alone which turns this into121121+ * a NOP. For X86 the rest of the stop/free flow ignores the flag.122122+ */123123+}124124+static inline void iommu_pages_free_incoherent(void *virt,125125+ struct device *dma_dev)126126+{127127+ iommu_free_pages(virt);128128+}129129+#else130130+#define IOMMU_PAGES_USE_DMA_API 1131131+#include <linux/dma-mapping.h>132132+133133+static inline void iommu_pages_flush_incoherent(struct device *dma_dev,134134+ void *virt, size_t offset,135135+ size_t len)136136+{137137+ dma_sync_single_for_device(dma_dev, (uintptr_t)virt + offset, len,138138+ DMA_TO_DEVICE);139139+}140140+void iommu_pages_stop_incoherent_list(struct iommu_pages_list *list,141141+ struct device *dma_dev);142142+void iommu_pages_free_incoherent(void *virt, struct device *dma_dev);143143+#endif144144+145145+#endif /* __IOMMU_PAGES_H */
+31-13
drivers/iommu/iommu.c
···100100 unsigned long action, void *data);101101static void iommu_release_device(struct device *dev);102102static int __iommu_attach_device(struct iommu_domain *domain,103103- struct device *dev);103103+ struct device *dev, struct iommu_domain *old);104104static int __iommu_attach_group(struct iommu_domain *domain,105105 struct iommu_group *group);106106static struct iommu_domain *__iommu_paging_domain_alloc_flags(struct device *dev,···114114static int __iommu_device_set_domain(struct iommu_group *group,115115 struct device *dev,116116 struct iommu_domain *new_domain,117117+ struct iommu_domain *old_domain,117118 unsigned int flags);118119static int __iommu_group_set_domain_internal(struct iommu_group *group,119120 struct iommu_domain *new_domain,···543542 * Regardless, if a delayed attach never occurred, then the release544543 * should still avoid touching any hardware configuration either.545544 */546546- if (!dev->iommu->attach_deferred && ops->release_domain)547547- ops->release_domain->ops->attach_dev(ops->release_domain, dev);545545+ if (!dev->iommu->attach_deferred && ops->release_domain) {546546+ struct iommu_domain *release_domain = ops->release_domain;547547+548548+ /*549549+ * If the device requires direct mappings then it should not550550+ * be parked on a BLOCKED domain during release as that would551551+ * break the direct mappings.552552+ */553553+ if (dev->iommu->require_direct && ops->identity_domain &&554554+ release_domain == ops->blocked_domain)555555+ release_domain = ops->identity_domain;556556+557557+ release_domain->ops->attach_dev(release_domain, dev,558558+ group->domain);559559+ }548560549561 if (ops->release_device)550562 ops->release_device(dev);···642628 if (group->default_domain)643629 iommu_create_device_direct_mappings(group->default_domain, dev);644630 if (group->domain) {645645- ret = __iommu_device_set_domain(group, dev, group->domain, 0);631631+ ret = __iommu_device_set_domain(group, dev, group->domain, NULL,632632+ 0);646633 if (ret)647634 goto err_remove_gdev;648635 } else if (!group->default_domain && !group_list) {···21302115}2131211621322117static int __iommu_attach_device(struct iommu_domain *domain,21332133- struct device *dev)21182118+ struct device *dev, struct iommu_domain *old)21342119{21352120 int ret;2136212121372122 if (unlikely(domain->ops->attach_dev == NULL))21382123 return -ENODEV;2139212421402140- ret = domain->ops->attach_dev(domain, dev);21252125+ ret = domain->ops->attach_dev(domain, dev, old);21412126 if (ret)21422127 return ret;21432128 dev->iommu->attach_deferred = 0;···21862171int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)21872172{21882173 if (dev->iommu && dev->iommu->attach_deferred)21892189- return __iommu_attach_device(domain, dev);21742174+ return __iommu_attach_device(domain, dev, NULL);2190217521912176 return 0;21922177}···22992284static int __iommu_device_set_domain(struct iommu_group *group,23002285 struct device *dev,23012286 struct iommu_domain *new_domain,22872287+ struct iommu_domain *old_domain,23022288 unsigned int flags)23032289{23042290 int ret;···23252309 dev->iommu->attach_deferred = 0;23262310 }2327231123282328- ret = __iommu_attach_device(new_domain, dev);23122312+ ret = __iommu_attach_device(new_domain, dev, old_domain);23292313 if (ret) {23302314 /*23312315 * If we have a blocking domain then try to attach that in hopes···23352319 if ((flags & IOMMU_SET_DOMAIN_MUST_SUCCEED) &&23362320 group->blocking_domain &&23372321 group->blocking_domain != new_domain)23382338- __iommu_attach_device(group->blocking_domain, dev);23222322+ __iommu_attach_device(group->blocking_domain, dev,23232323+ old_domain);23392324 return ret;23402325 }23412326 return 0;···23832366 result = 0;23842367 for_each_group_device(group, gdev) {23852368 ret = __iommu_device_set_domain(group, gdev->dev, new_domain,23862386- flags);23692369+ group->domain, flags);23872370 if (ret) {23882371 result = ret;23892372 /*···24082391 */24092392 last_gdev = gdev;24102393 for_each_group_device(group, gdev) {23942394+ /* No need to revert the last gdev that failed to set domain */23952395+ if (gdev == last_gdev)23962396+ break;24112397 /*24122398 * A NULL domain can happen only for first probe, in which case24132399 * we leave group->domain as NULL and let release clean···24182398 */24192399 if (group->domain)24202400 WARN_ON(__iommu_device_set_domain(24212421- group, gdev->dev, group->domain,24012401+ group, gdev->dev, group->domain, new_domain,24222402 IOMMU_SET_DOMAIN_MUST_SUCCEED));24232423- if (gdev == last_gdev)24242424- break;24252403 }24262404 return ret;24272405}
+1
drivers/iommu/iommufd/Kconfig
···4141 depends on DEBUG_KERNEL4242 depends on FAULT_INJECTION4343 depends on RUNTIME_TESTING_MENU4444+ depends on IOMMU_PT_AMDV14445 select IOMMUFD_DRIVER4546 default n4647 help
+10-1
drivers/iommu/iommufd/iommufd_test.h
···3232};33333434enum {3535+ MOCK_IOMMUPT_DEFAULT = 0,3636+ MOCK_IOMMUPT_HUGE,3737+ MOCK_IOMMUPT_AMDV1,3838+};3939+4040+/* These values are true for MOCK_IOMMUPT_DEFAULT */4141+enum {3542 MOCK_APERTURE_START = 1UL << 24,3643 MOCK_APERTURE_LAST = (1UL << 31) - 1,4444+ MOCK_PAGE_SIZE = 2048,4545+ MOCK_HUGE_PAGE_SIZE = 512 * MOCK_PAGE_SIZE,3746};38473948enum {···61526253enum {6354 MOCK_FLAGS_DEVICE_NO_DIRTY = 1 << 0,6464- MOCK_FLAGS_DEVICE_HUGE_IOVA = 1 << 1,6555 MOCK_FLAGS_DEVICE_PASID = 1 << 2,6656};6757···213205 */214206struct iommu_hwpt_selftest {215207 __u32 iotlb;208208+ __u32 pagetable_type;216209};217210218211/* Should not be equal to any defined value in enum iommu_hwpt_invalidate_data_type */
+178-262
drivers/iommu/iommufd/selftest.c
···1212#include <linux/slab.h>1313#include <linux/xarray.h>1414#include <uapi/linux/iommufd.h>1515+#include <linux/generic_pt/iommu.h>1616+#include "../iommu-pages.h"15171618#include "../iommu-priv.h"1719#include "io_pagetable.h"···43414442enum {4543 MOCK_DIRTY_TRACK = 1,4646- MOCK_IO_PAGE_SIZE = PAGE_SIZE / 2,4747- MOCK_HUGE_PAGE_SIZE = 512 * MOCK_IO_PAGE_SIZE,4848-4949- /*5050- * Like a real page table alignment requires the low bits of the address5151- * to be zero. xarray also requires the high bit to be zero, so we store5252- * the pfns shifted. The upper bits are used for metadata.5353- */5454- MOCK_PFN_MASK = ULONG_MAX / MOCK_IO_PAGE_SIZE,5555-5656- _MOCK_PFN_START = MOCK_PFN_MASK + 1,5757- MOCK_PFN_START_IOVA = _MOCK_PFN_START,5858- MOCK_PFN_LAST_IOVA = _MOCK_PFN_START,5959- MOCK_PFN_DIRTY_IOVA = _MOCK_PFN_START << 1,6060- MOCK_PFN_HUGE_IOVA = _MOCK_PFN_START << 2,6144};62456346static int mock_dev_enable_iopf(struct device *dev, struct iommu_domain *domain);···111124}112125113126struct mock_iommu_domain {127127+ union {128128+ struct iommu_domain domain;129129+ struct pt_iommu iommu;130130+ struct pt_iommu_amdv1 amdv1;131131+ };114132 unsigned long flags;115115- struct iommu_domain domain;116116- struct xarray pfns;117133};134134+PT_IOMMU_CHECK_DOMAIN(struct mock_iommu_domain, iommu, domain);135135+PT_IOMMU_CHECK_DOMAIN(struct mock_iommu_domain, amdv1.iommu, domain);118136119137static inline struct mock_iommu_domain *120138to_mock_domain(struct iommu_domain *domain)···208216}209217210218static int mock_domain_nop_attach(struct iommu_domain *domain,211211- struct device *dev)219219+ struct device *dev, struct iommu_domain *old)212220{213221 struct mock_dev *mdev = to_mock_dev(dev);214222 struct mock_viommu *new_viommu = NULL;···336344 return 0;337345}338346339339-static bool mock_test_and_clear_dirty(struct mock_iommu_domain *mock,340340- unsigned long iova, size_t page_size,341341- unsigned long flags)342342-{343343- unsigned long cur, end = iova + page_size - 1;344344- bool dirty = false;345345- void *ent, *old;346346-347347- for (cur = iova; cur < end; cur += MOCK_IO_PAGE_SIZE) {348348- ent = xa_load(&mock->pfns, cur / MOCK_IO_PAGE_SIZE);349349- if (!ent || !(xa_to_value(ent) & MOCK_PFN_DIRTY_IOVA))350350- continue;351351-352352- dirty = true;353353- /* Clear dirty */354354- if (!(flags & IOMMU_DIRTY_NO_CLEAR)) {355355- unsigned long val;356356-357357- val = xa_to_value(ent) & ~MOCK_PFN_DIRTY_IOVA;358358- old = xa_store(&mock->pfns, cur / MOCK_IO_PAGE_SIZE,359359- xa_mk_value(val), GFP_KERNEL);360360- WARN_ON_ONCE(ent != old);361361- }362362- }363363-364364- return dirty;365365-}366366-367367-static int mock_domain_read_and_clear_dirty(struct iommu_domain *domain,368368- unsigned long iova, size_t size,369369- unsigned long flags,370370- struct iommu_dirty_bitmap *dirty)371371-{372372- struct mock_iommu_domain *mock = to_mock_domain(domain);373373- unsigned long end = iova + size;374374- void *ent;375375-376376- if (!(mock->flags & MOCK_DIRTY_TRACK) && dirty->bitmap)377377- return -EINVAL;378378-379379- do {380380- unsigned long pgsize = MOCK_IO_PAGE_SIZE;381381- unsigned long head;382382-383383- ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);384384- if (!ent) {385385- iova += pgsize;386386- continue;387387- }388388-389389- if (xa_to_value(ent) & MOCK_PFN_HUGE_IOVA)390390- pgsize = MOCK_HUGE_PAGE_SIZE;391391- head = iova & ~(pgsize - 1);392392-393393- /* Clear dirty */394394- if (mock_test_and_clear_dirty(mock, head, pgsize, flags))395395- iommu_dirty_bitmap_record(dirty, iova, pgsize);396396- iova += pgsize;397397- } while (iova < end);398398-399399- return 0;400400-}401401-402402-static const struct iommu_dirty_ops dirty_ops = {403403- .set_dirty_tracking = mock_domain_set_dirty_tracking,404404- .read_and_clear_dirty = mock_domain_read_and_clear_dirty,405405-};406406-407347static struct mock_iommu_domain_nested *408348__mock_domain_alloc_nested(const struct iommu_user_data *user_data)409349{···370446371447 if (flags & ~IOMMU_HWPT_ALLOC_PASID)372448 return ERR_PTR(-EOPNOTSUPP);373373- if (!parent || parent->ops != mock_ops.default_domain_ops)449449+ if (!parent || !(parent->type & __IOMMU_DOMAIN_PAGING))374450 return ERR_PTR(-EINVAL);375451376452 mock_parent = to_mock_domain(parent);···383459 return &mock_nested->domain;384460}385461462462+static void mock_domain_free(struct iommu_domain *domain)463463+{464464+ struct mock_iommu_domain *mock = to_mock_domain(domain);465465+466466+ pt_iommu_deinit(&mock->iommu);467467+ kfree(mock);468468+}469469+470470+static void mock_iotlb_sync(struct iommu_domain *domain,471471+ struct iommu_iotlb_gather *gather)472472+{473473+ iommu_put_pages_list(&gather->freelist);474474+}475475+476476+static const struct iommu_domain_ops amdv1_mock_ops = {477477+ IOMMU_PT_DOMAIN_OPS(amdv1_mock),478478+ .free = mock_domain_free,479479+ .attach_dev = mock_domain_nop_attach,480480+ .set_dev_pasid = mock_domain_set_dev_pasid_nop,481481+ .iotlb_sync = &mock_iotlb_sync,482482+};483483+484484+static const struct iommu_domain_ops amdv1_mock_huge_ops = {485485+ IOMMU_PT_DOMAIN_OPS(amdv1_mock),486486+ .free = mock_domain_free,487487+ .attach_dev = mock_domain_nop_attach,488488+ .set_dev_pasid = mock_domain_set_dev_pasid_nop,489489+ .iotlb_sync = &mock_iotlb_sync,490490+};491491+#undef pt_iommu_amdv1_mock_map_pages492492+493493+static const struct iommu_dirty_ops amdv1_mock_dirty_ops = {494494+ IOMMU_PT_DIRTY_OPS(amdv1_mock),495495+ .set_dirty_tracking = mock_domain_set_dirty_tracking,496496+};497497+498498+static const struct iommu_domain_ops amdv1_ops = {499499+ IOMMU_PT_DOMAIN_OPS(amdv1),500500+ .free = mock_domain_free,501501+ .attach_dev = mock_domain_nop_attach,502502+ .set_dev_pasid = mock_domain_set_dev_pasid_nop,503503+ .iotlb_sync = &mock_iotlb_sync,504504+};505505+506506+static const struct iommu_dirty_ops amdv1_dirty_ops = {507507+ IOMMU_PT_DIRTY_OPS(amdv1),508508+ .set_dirty_tracking = mock_domain_set_dirty_tracking,509509+};510510+511511+static struct mock_iommu_domain *512512+mock_domain_alloc_pgtable(struct device *dev,513513+ const struct iommu_hwpt_selftest *user_cfg, u32 flags)514514+{515515+ struct mock_iommu_domain *mock;516516+ int rc;517517+518518+ mock = kzalloc(sizeof(*mock), GFP_KERNEL);519519+ if (!mock)520520+ return ERR_PTR(-ENOMEM);521521+ mock->domain.type = IOMMU_DOMAIN_UNMANAGED;522522+523523+ mock->amdv1.iommu.nid = NUMA_NO_NODE;524524+525525+ switch (user_cfg->pagetable_type) {526526+ case MOCK_IOMMUPT_DEFAULT:527527+ case MOCK_IOMMUPT_HUGE: {528528+ struct pt_iommu_amdv1_cfg cfg = {};529529+530530+ /* The mock version has a 2k page size */531531+ cfg.common.hw_max_vasz_lg2 = 56;532532+ cfg.common.hw_max_oasz_lg2 = 51;533533+ cfg.starting_level = 2;534534+ if (user_cfg->pagetable_type == MOCK_IOMMUPT_HUGE)535535+ mock->domain.ops = &amdv1_mock_huge_ops;536536+ else537537+ mock->domain.ops = &amdv1_mock_ops;538538+ rc = pt_iommu_amdv1_mock_init(&mock->amdv1, &cfg, GFP_KERNEL);539539+ if (rc)540540+ goto err_free;541541+542542+ /*543543+ * In huge mode userspace should only provide huge pages, we544544+ * have to include PAGE_SIZE for the domain to be accepted by545545+ * iommufd.546546+ */547547+ if (user_cfg->pagetable_type == MOCK_IOMMUPT_HUGE)548548+ mock->domain.pgsize_bitmap = MOCK_HUGE_PAGE_SIZE |549549+ PAGE_SIZE;550550+ if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)551551+ mock->domain.dirty_ops = &amdv1_mock_dirty_ops;552552+ break;553553+ }554554+555555+ case MOCK_IOMMUPT_AMDV1: {556556+ struct pt_iommu_amdv1_cfg cfg = {};557557+558558+ cfg.common.hw_max_vasz_lg2 = 64;559559+ cfg.common.hw_max_oasz_lg2 = 52;560560+ cfg.common.features = BIT(PT_FEAT_DYNAMIC_TOP) |561561+ BIT(PT_FEAT_AMDV1_ENCRYPT_TABLES) |562562+ BIT(PT_FEAT_AMDV1_FORCE_COHERENCE);563563+ cfg.starting_level = 2;564564+ mock->domain.ops = &amdv1_ops;565565+ rc = pt_iommu_amdv1_init(&mock->amdv1, &cfg, GFP_KERNEL);566566+ if (rc)567567+ goto err_free;568568+ if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)569569+ mock->domain.dirty_ops = &amdv1_dirty_ops;570570+ break;571571+ }572572+ default:573573+ rc = -EOPNOTSUPP;574574+ goto err_free;575575+ }576576+577577+ /*578578+ * Override the real aperture to the MOCK aperture for test purposes.579579+ */580580+ if (user_cfg->pagetable_type == MOCK_IOMMUPT_DEFAULT) {581581+ WARN_ON(mock->domain.geometry.aperture_start != 0);582582+ WARN_ON(mock->domain.geometry.aperture_end < MOCK_APERTURE_LAST);583583+584584+ mock->domain.geometry.aperture_start = MOCK_APERTURE_START;585585+ mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;586586+ }587587+588588+ return mock;589589+err_free:590590+ kfree(mock);591591+ return ERR_PTR(rc);592592+}593593+386594static struct iommu_domain *387595mock_domain_alloc_paging_flags(struct device *dev, u32 flags,388596 const struct iommu_user_data *user_data)···525469 IOMMU_HWPT_ALLOC_PASID;526470 struct mock_dev *mdev = to_mock_dev(dev);527471 bool no_dirty_ops = mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY;472472+ struct iommu_hwpt_selftest user_cfg = {};528473 struct mock_iommu_domain *mock;474474+ int rc;529475530530- if (user_data)531531- return ERR_PTR(-EOPNOTSUPP);532476 if ((flags & ~PAGING_FLAGS) || (has_dirty_flag && no_dirty_ops))533477 return ERR_PTR(-EOPNOTSUPP);534478535535- mock = kzalloc(sizeof(*mock), GFP_KERNEL);536536- if (!mock)537537- return ERR_PTR(-ENOMEM);538538- mock->domain.geometry.aperture_start = MOCK_APERTURE_START;539539- mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;540540- mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;541541- if (dev && mdev->flags & MOCK_FLAGS_DEVICE_HUGE_IOVA)542542- mock->domain.pgsize_bitmap |= MOCK_HUGE_PAGE_SIZE;543543- mock->domain.ops = mock_ops.default_domain_ops;544544- mock->domain.type = IOMMU_DOMAIN_UNMANAGED;545545- xa_init(&mock->pfns);479479+ if (user_data && (user_data->type != IOMMU_HWPT_DATA_SELFTEST &&480480+ user_data->type != IOMMU_HWPT_DATA_NONE))481481+ return ERR_PTR(-EOPNOTSUPP);546482547547- if (has_dirty_flag)548548- mock->domain.dirty_ops = &dirty_ops;483483+ if (user_data) {484484+ rc = iommu_copy_struct_from_user(485485+ &user_cfg, user_data, IOMMU_HWPT_DATA_SELFTEST, iotlb);486486+ if (rc)487487+ return ERR_PTR(rc);488488+ }489489+490490+ mock = mock_domain_alloc_pgtable(dev, &user_cfg, flags);491491+ if (IS_ERR(mock))492492+ return ERR_CAST(mock);549493 return &mock->domain;550550-}551551-552552-static void mock_domain_free(struct iommu_domain *domain)553553-{554554- struct mock_iommu_domain *mock = to_mock_domain(domain);555555-556556- WARN_ON(!xa_empty(&mock->pfns));557557- kfree(mock);558558-}559559-560560-static int mock_domain_map_pages(struct iommu_domain *domain,561561- unsigned long iova, phys_addr_t paddr,562562- size_t pgsize, size_t pgcount, int prot,563563- gfp_t gfp, size_t *mapped)564564-{565565- struct mock_iommu_domain *mock = to_mock_domain(domain);566566- unsigned long flags = MOCK_PFN_START_IOVA;567567- unsigned long start_iova = iova;568568-569569- /*570570- * xarray does not reliably work with fault injection because it does a571571- * retry allocation, so put our own failure point.572572- */573573- if (iommufd_should_fail())574574- return -ENOENT;575575-576576- WARN_ON(iova % MOCK_IO_PAGE_SIZE);577577- WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);578578- for (; pgcount; pgcount--) {579579- size_t cur;580580-581581- for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {582582- void *old;583583-584584- if (pgcount == 1 && cur + MOCK_IO_PAGE_SIZE == pgsize)585585- flags = MOCK_PFN_LAST_IOVA;586586- if (pgsize != MOCK_IO_PAGE_SIZE) {587587- flags |= MOCK_PFN_HUGE_IOVA;588588- }589589- old = xa_store(&mock->pfns, iova / MOCK_IO_PAGE_SIZE,590590- xa_mk_value((paddr / MOCK_IO_PAGE_SIZE) |591591- flags),592592- gfp);593593- if (xa_is_err(old)) {594594- for (; start_iova != iova;595595- start_iova += MOCK_IO_PAGE_SIZE)596596- xa_erase(&mock->pfns,597597- start_iova /598598- MOCK_IO_PAGE_SIZE);599599- return xa_err(old);600600- }601601- WARN_ON(old);602602- iova += MOCK_IO_PAGE_SIZE;603603- paddr += MOCK_IO_PAGE_SIZE;604604- *mapped += MOCK_IO_PAGE_SIZE;605605- flags = 0;606606- }607607- }608608- return 0;609609-}610610-611611-static size_t mock_domain_unmap_pages(struct iommu_domain *domain,612612- unsigned long iova, size_t pgsize,613613- size_t pgcount,614614- struct iommu_iotlb_gather *iotlb_gather)615615-{616616- struct mock_iommu_domain *mock = to_mock_domain(domain);617617- bool first = true;618618- size_t ret = 0;619619- void *ent;620620-621621- WARN_ON(iova % MOCK_IO_PAGE_SIZE);622622- WARN_ON(pgsize % MOCK_IO_PAGE_SIZE);623623-624624- for (; pgcount; pgcount--) {625625- size_t cur;626626-627627- for (cur = 0; cur != pgsize; cur += MOCK_IO_PAGE_SIZE) {628628- ent = xa_erase(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);629629-630630- /*631631- * iommufd generates unmaps that must be a strict632632- * superset of the map's performend So every633633- * starting/ending IOVA should have been an iova passed634634- * to map.635635- *636636- * This simple logic doesn't work when the HUGE_PAGE is637637- * turned on since the core code will automatically638638- * switch between the two page sizes creating a break in639639- * the unmap calls. The break can land in the middle of640640- * contiguous IOVA.641641- */642642- if (!(domain->pgsize_bitmap & MOCK_HUGE_PAGE_SIZE)) {643643- if (first) {644644- WARN_ON(ent && !(xa_to_value(ent) &645645- MOCK_PFN_START_IOVA));646646- first = false;647647- }648648- if (pgcount == 1 &&649649- cur + MOCK_IO_PAGE_SIZE == pgsize)650650- WARN_ON(ent && !(xa_to_value(ent) &651651- MOCK_PFN_LAST_IOVA));652652- }653653-654654- iova += MOCK_IO_PAGE_SIZE;655655- ret += MOCK_IO_PAGE_SIZE;656656- }657657- }658658- return ret;659659-}660660-661661-static phys_addr_t mock_domain_iova_to_phys(struct iommu_domain *domain,662662- dma_addr_t iova)663663-{664664- struct mock_iommu_domain *mock = to_mock_domain(domain);665665- void *ent;666666-667667- WARN_ON(iova % MOCK_IO_PAGE_SIZE);668668- ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);669669- WARN_ON(!ent);670670- return (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE;671494}672495673496static bool mock_domain_capable(struct device *dev, enum iommu_cap cap)···890955 .user_pasid_table = true,891956 .get_viommu_size = mock_get_viommu_size,892957 .viommu_init = mock_viommu_init,893893- .default_domain_ops =894894- &(struct iommu_domain_ops){895895- .free = mock_domain_free,896896- .attach_dev = mock_domain_nop_attach,897897- .map_pages = mock_domain_map_pages,898898- .unmap_pages = mock_domain_unmap_pages,899899- .iova_to_phys = mock_domain_iova_to_phys,900900- .set_dev_pasid = mock_domain_set_dev_pasid_nop,901901- },902958};903959904960static void mock_domain_free_nested(struct iommu_domain *domain)···9731047 if (IS_ERR(hwpt))9741048 return hwpt;9751049 if (hwpt->domain->type != IOMMU_DOMAIN_UNMANAGED ||976976- hwpt->domain->ops != mock_ops.default_domain_ops) {10501050+ hwpt->domain->owner != &mock_ops) {9771051 iommufd_put_object(ucmd->ictx, &hwpt->obj);9781052 return ERR_PTR(-EINVAL);9791053 }···10141088 {},10151089 };10161090 const u32 valid_flags = MOCK_FLAGS_DEVICE_NO_DIRTY |10171017- MOCK_FLAGS_DEVICE_HUGE_IOVA |10181091 MOCK_FLAGS_DEVICE_PASID;10191092 struct mock_dev *mdev;10201093 int rc, i;···12021277{12031278 struct iommufd_hw_pagetable *hwpt;12041279 struct mock_iommu_domain *mock;12801280+ unsigned int page_size;12051281 uintptr_t end;12061282 int rc;12071207-12081208- if (iova % MOCK_IO_PAGE_SIZE || length % MOCK_IO_PAGE_SIZE ||12091209- (uintptr_t)uptr % MOCK_IO_PAGE_SIZE ||12101210- check_add_overflow((uintptr_t)uptr, (uintptr_t)length, &end))12111211- return -EINVAL;1212128312131284 hwpt = get_md_pagetable(ucmd, mockpt_id, &mock);12141285 if (IS_ERR(hwpt))12151286 return PTR_ERR(hwpt);1216128712171217- for (; length; length -= MOCK_IO_PAGE_SIZE) {12881288+ page_size = 1 << __ffs(mock->domain.pgsize_bitmap);12891289+ if (iova % page_size || length % page_size ||12901290+ (uintptr_t)uptr % page_size ||12911291+ check_add_overflow((uintptr_t)uptr, (uintptr_t)length, &end))12921292+ return -EINVAL;12931293+12941294+ for (; length; length -= page_size) {12181295 struct page *pages[1];12961296+ phys_addr_t io_phys;12191297 unsigned long pfn;12201298 long npages;12211221- void *ent;1222129912231300 npages = get_user_pages_fast((uintptr_t)uptr & PAGE_MASK, 1, 0,12241301 pages);···12351308 pfn = page_to_pfn(pages[0]);12361309 put_page(pages[0]);1237131012381238- ent = xa_load(&mock->pfns, iova / MOCK_IO_PAGE_SIZE);12391239- if (!ent ||12401240- (xa_to_value(ent) & MOCK_PFN_MASK) * MOCK_IO_PAGE_SIZE !=12411241- pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {13111311+ io_phys = mock->domain.ops->iova_to_phys(&mock->domain, iova);13121312+ if (io_phys !=13131313+ pfn * PAGE_SIZE + ((uintptr_t)uptr % PAGE_SIZE)) {12421314 rc = -EINVAL;12431315 goto out_put;12441316 }12451245- iova += MOCK_IO_PAGE_SIZE;12461246- uptr += MOCK_IO_PAGE_SIZE;13171317+ iova += page_size;13181318+ uptr += page_size;12471319 }12481320 rc = 0;12491321···17211795 if (IS_ERR(hwpt))17221796 return PTR_ERR(hwpt);1723179717241724- if (!(mock->flags & MOCK_DIRTY_TRACK)) {17981798+ if (!(mock->flags & MOCK_DIRTY_TRACK) || !mock->iommu.ops->set_dirty) {17251799 rc = -EINVAL;17261800 goto out_put;17271801 }···17401814 }1741181517421816 for (i = 0; i < max; i++) {17431743- unsigned long cur = iova + i * page_size;17441744- void *ent, *old;17451745-17461817 if (!test_bit(i, (unsigned long *)tmp))17471818 continue;17481748-17491749- ent = xa_load(&mock->pfns, cur / page_size);17501750- if (ent) {17511751- unsigned long val;17521752-17531753- val = xa_to_value(ent) | MOCK_PFN_DIRTY_IOVA;17541754- old = xa_store(&mock->pfns, cur / page_size,17551755- xa_mk_value(val), GFP_KERNEL);17561756- WARN_ON_ONCE(ent != old);17571757- count++;17581758- }18191819+ mock->iommu.ops->set_dirty(&mock->iommu, iova + i * page_size);18201820+ count++;17591821 }1760182217611823 cmd->dirty.out_nr_dirty = count;···21162202 platform_device_unregister(selftest_iommu_dev);21172203 debugfs_remove_recursive(dbgfs_root);21182204}22052205+22062206+MODULE_IMPORT_NS("GENERIC_PT_IOMMU");
···8888/**8989 * struct omap_iommu_arch_data - omap iommu private data9090 * @iommu_dev: handle of the OMAP iommu device9191- * @dev: handle of the iommu device9291 *9392 * This is an omap iommu private data object, which binds an iommu user9493 * to its iommu device. This object should be placed at the iommu user's···9697 */9798struct omap_iommu_arch_data {9899 struct omap_iommu *iommu_dev;9999- struct device *dev;100100};101101102102struct cr_regs {
···960960}961961962962static int rk_iommu_identity_attach(struct iommu_domain *identity_domain,963963- struct device *dev)963963+ struct device *dev,964964+ struct iommu_domain *old)964965{965966 struct rk_iommu *iommu;966967 struct rk_iommu_domain *rk_domain;···10061005};1007100610081007static int rk_iommu_attach_device(struct iommu_domain *domain,10091009- struct device *dev)10081008+ struct device *dev, struct iommu_domain *old)10101009{10111010 struct rk_iommu *iommu;10121011 struct rk_iommu_domain *rk_domain = to_rk_domain(domain);···10271026 if (iommu->domain == domain)10281027 return 0;1029102810301030- ret = rk_iommu_identity_attach(&rk_identity_domain, dev);10291029+ ret = rk_iommu_identity_attach(&rk_identity_domain, dev, old);10311030 if (ret)10321031 return ret;10331032···10421041 return 0;1043104210441043 ret = rk_iommu_enable(iommu);10451045- if (ret)10461046- WARN_ON(rk_iommu_identity_attach(&rk_identity_domain, dev));10441044+ if (ret) {10451045+ /*10461046+ * Note rk_iommu_identity_attach() might fail before physically10471047+ * attaching the dev to iommu->domain, in which case the actual10481048+ * old domain for this revert should be rk_identity_domain v.s.10491049+ * iommu->domain. Since rk_iommu_identity_attach() does not care10501050+ * about the old domain argument for now, this is not a problem.10511051+ */10521052+ WARN_ON(rk_iommu_identity_attach(&rk_identity_domain, dev,10531053+ iommu->domain));10541054+ }1047105510481056 pm_runtime_put(iommu->dev);10491057
+8-5
drivers/iommu/s390-iommu.c
···670670}671671672672static int blocking_domain_attach_device(struct iommu_domain *domain,673673- struct device *dev)673673+ struct device *dev,674674+ struct iommu_domain *old)674675{675676 struct zpci_dev *zdev = to_zpci_dev(dev);676677 struct s390_domain *s390_domain;···695694}696695697696static int s390_iommu_attach_device(struct iommu_domain *domain,698698- struct device *dev)697697+ struct device *dev,698698+ struct iommu_domain *old)699699{700700 struct s390_domain *s390_domain = to_s390_domain(domain);701701 struct zpci_dev *zdev = to_zpci_dev(dev);···711709 domain->geometry.aperture_end < zdev->start_dma))712710 return -EINVAL;713711714714- blocking_domain_attach_device(&blocking_domain, dev);712712+ blocking_domain_attach_device(&blocking_domain, dev, old);715713716714 /* If we fail now DMA remains blocked via blocking domain */717715 cc = s390_iommu_domain_reg_ioat(zdev, domain, &status);···11331131subsys_initcall(s390_iommu_init);1134113211351133static int s390_attach_dev_identity(struct iommu_domain *domain,11361136- struct device *dev)11341134+ struct device *dev,11351135+ struct iommu_domain *old)11371136{11381137 struct zpci_dev *zdev = to_zpci_dev(dev);11391138 u8 status;11401139 int cc;1141114011421142- blocking_domain_attach_device(&blocking_domain, dev);11411141+ blocking_domain_attach_device(&blocking_domain, dev, old);1143114211441143 /* If we fail now DMA remains blocked via blocking domain */11451144 cc = s390_iommu_domain_reg_ioat(zdev, domain, &status);
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ */55+#ifndef __GENERIC_PT_COMMON_H66+#define __GENERIC_PT_COMMON_H77+88+#include <linux/types.h>99+#include <linux/build_bug.h>1010+#include <linux/bits.h>1111+1212+/**1313+ * DOC: Generic Radix Page Table1414+ *1515+ * Generic Radix Page Table is a set of functions and helpers to efficiently1616+ * parse radix style page tables typically seen in HW implementations. The1717+ * interface is built to deliver similar code generation as the mm's pte/pmd/etc1818+ * system by fully inlining the exact code required to handle each table level.1919+ *2020+ * Like the mm subsystem each format contributes its parsing implementation2121+ * under common names and the common code implements the required algorithms.2222+ *2323+ * The system is divided into three logical levels:2424+ *2525+ * - The page table format and its manipulation functions2626+ * - Generic helpers to give a consistent API regardless of underlying format2727+ * - An algorithm implementation (e.g. IOMMU/DRM/KVM/MM)2828+ *2929+ * Multiple implementations are supported. The intention is to have the generic3030+ * format code be re-usable for whatever specialized implementation is required.3131+ * The generic code is solely about the format of the radix tree; it does not3232+ * include memory allocation or higher level decisions that are left for the3333+ * implementation.3434+ *3535+ * The generic framework supports a superset of functions across many HW3636+ * implementations:3737+ *3838+ * - Entries comprised of contiguous blocks of IO PTEs for larger page sizes3939+ * - Multi-level tables, up to 6 levels. Runtime selected top level4040+ * - Runtime variable table level size (ARM's concatenated tables)4141+ * - Expandable top level allowing dynamic sizing of table levels4242+ * - Optional leaf entries at any level4343+ * - 32-bit/64-bit virtual and output addresses, using every address bit4444+ * - Dirty tracking4545+ * - Sign extended addressing4646+ */4747+4848+/**4949+ * struct pt_common - struct for all page table implementations5050+ */5151+struct pt_common {5252+ /**5353+ * @top_of_table: Encodes the table top pointer and the top level in a5454+ * single value. Must use READ_ONCE/WRITE_ONCE to access it. The lower5555+ * bits of the aligned table pointer are used for the level.5656+ */5757+ uintptr_t top_of_table;5858+ /**5959+ * @max_oasz_lg2: Maximum number of bits the OA can contain. Upper bits6060+ * must be zero. This may be less than what the page table format6161+ * supports, but must not be more.6262+ */6363+ u8 max_oasz_lg2;6464+ /**6565+ * @max_vasz_lg2: Maximum number of bits the VA can contain. Upper bits6666+ * are 0 or 1 depending on pt_full_va_prefix(). This may be less than6767+ * what the page table format supports, but must not be more. When6868+ * PT_FEAT_DYNAMIC_TOP is set this reflects the maximum VA capability.6969+ */7070+ u8 max_vasz_lg2;7171+ /**7272+ * @features: Bitmap of `enum pt_features`7373+ */7474+ unsigned int features;7575+};7676+7777+/* Encoding parameters for top_of_table */7878+enum {7979+ PT_TOP_LEVEL_BITS = 3,8080+ PT_TOP_LEVEL_MASK = GENMASK(PT_TOP_LEVEL_BITS - 1, 0),8181+};8282+8383+/**8484+ * enum pt_features - Features turned on in the table. Each symbol is a bit8585+ * position.8686+ */8787+enum pt_features {8888+ /**8989+ * @PT_FEAT_DMA_INCOHERENT: Cache flush page table memory before9090+ * assuming the HW can read it. Otherwise a SMP release is sufficient9191+ * for HW to read it.9292+ */9393+ PT_FEAT_DMA_INCOHERENT,9494+ /**9595+ * @PT_FEAT_FULL_VA: The table can span the full VA range from 0 to9696+ * PT_VADDR_MAX.9797+ */9898+ PT_FEAT_FULL_VA,9999+ /**100100+ * @PT_FEAT_DYNAMIC_TOP: The table's top level can be increased101101+ * dynamically during map. This requires HW support for atomically102102+ * setting both the table top pointer and the starting table level.103103+ */104104+ PT_FEAT_DYNAMIC_TOP,105105+ /**106106+ * @PT_FEAT_SIGN_EXTEND: The top most bit of the valid VA range sign107107+ * extends up to the full pt_vaddr_t. This divides the page table into108108+ * three VA ranges::109109+ *110110+ * 0 -> 2^N - 1 Lower111111+ * 2^N -> (MAX - 2^N - 1) Non-Canonical112112+ * MAX - 2^N -> MAX Upper113113+ *114114+ * In this mode pt_common::max_vasz_lg2 includes the sign bit and the115115+ * upper bits that don't fall within the translation are just validated.116116+ *117117+ * If not set there is no sign extension and valid VA goes from 0 to 2^N118118+ * - 1.119119+ */120120+ PT_FEAT_SIGN_EXTEND,121121+ /**122122+ * @PT_FEAT_FLUSH_RANGE: IOTLB maintenance is done by flushing IOVA123123+ * ranges which will clean out any walk cache or any IOPTE fully124124+ * contained by the range. The optimization objective is to minimize the125125+ * number of flushes even if ranges include IOVA gaps that do not need126126+ * to be flushed.127127+ */128128+ PT_FEAT_FLUSH_RANGE,129129+ /**130130+ * @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE except that131131+ * the optimization objective is to only flush IOVA that has been132132+ * changed. This mode is suitable for cases like hypervisor shadowing133133+ * where flushing unchanged ranges may cause the hypervisor to reparse134134+ * significant amount of page table.135135+ */136136+ PT_FEAT_FLUSH_RANGE_NO_GAPS,137137+ /* private: */138138+ PT_FEAT_FMT_START,139139+};140140+141141+struct pt_amdv1 {142142+ struct pt_common common;143143+};144144+145145+enum {146146+ /*147147+ * The memory backing the tables is encrypted. Use __sme_set() to adjust148148+ * the page table pointers in the tree. This only works with149149+ * CONFIG_AMD_MEM_ENCRYPT.150150+ */151151+ PT_FEAT_AMDV1_ENCRYPT_TABLES = PT_FEAT_FMT_START,152152+ /*153153+ * The PTEs are set to prevent cache incoherent traffic, such as PCI no154154+ * snoop. This is set either at creation time or before the first map155155+ * operation.156156+ */157157+ PT_FEAT_AMDV1_FORCE_COHERENCE,158158+};159159+160160+struct pt_vtdss {161161+ struct pt_common common;162162+};163163+164164+enum {165165+ /*166166+ * The PTEs are set to prevent cache incoherent traffic, such as PCI no167167+ * snoop. This is set either at creation time or before the first map168168+ * operation.169169+ */170170+ PT_FEAT_VTDSS_FORCE_COHERENCE = PT_FEAT_FMT_START,171171+ /*172172+ * Prevent creating read-only PTEs. Used to work around HW errata173173+ * ERRATA_772415_SPR17.174174+ */175175+ PT_FEAT_VTDSS_FORCE_WRITEABLE,176176+};177177+178178+struct pt_x86_64 {179179+ struct pt_common common;180180+};181181+182182+enum {183183+ /*184184+ * The memory backing the tables is encrypted. Use __sme_set() to adjust185185+ * the page table pointers in the tree. This only works with186186+ * CONFIG_AMD_MEM_ENCRYPT.187187+ */188188+ PT_FEAT_X86_64_AMD_ENCRYPT_TABLES = PT_FEAT_FMT_START,189189+};190190+191191+#endif
+293
include/linux/generic_pt/iommu.h
···11+/* SPDX-License-Identifier: GPL-2.0-only */22+/*33+ * Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES44+ */55+#ifndef __GENERIC_PT_IOMMU_H66+#define __GENERIC_PT_IOMMU_H77+88+#include <linux/generic_pt/common.h>99+#include <linux/iommu.h>1010+#include <linux/mm_types.h>1111+1212+struct iommu_iotlb_gather;1313+struct pt_iommu_ops;1414+struct pt_iommu_driver_ops;1515+struct iommu_dirty_bitmap;1616+1717+/**1818+ * DOC: IOMMU Radix Page Table1919+ *2020+ * The IOMMU implementation of the Generic Page Table provides an ops struct2121+ * that is useful to go with an iommu_domain to serve the DMA API, IOMMUFD and2222+ * the generic map/unmap interface.2323+ *2424+ * This interface uses a caller provided locking approach. The caller must have2525+ * a VA range lock concept that prevents concurrent threads from calling ops on2626+ * the same VA. Generally the range lock must be at least as large as a single2727+ * map call.2828+ */2929+3030+/**3131+ * struct pt_iommu - Base structure for IOMMU page tables3232+ *3333+ * The format-specific struct will include this as the first member.3434+ */3535+struct pt_iommu {3636+ /**3737+ * @domain: The core IOMMU domain. The driver should use a union to3838+ * overlay this memory with its previously existing domain struct to3939+ * create an alias.4040+ */4141+ struct iommu_domain domain;4242+4343+ /**4444+ * @ops: Function pointers to access the API4545+ */4646+ const struct pt_iommu_ops *ops;4747+4848+ /**4949+ * @driver_ops: Function pointers provided by the HW driver to help5050+ * manage HW details like caches.5151+ */5252+ const struct pt_iommu_driver_ops *driver_ops;5353+5454+ /**5555+ * @nid: Node ID to use for table memory allocations. The IOMMU driver5656+ * may want to set the NID to the device's NID, if there are multiple5757+ * table walkers.5858+ */5959+ int nid;6060+6161+ /**6262+ * @iommu_device: Device pointer used for any DMA cache flushing when6363+ * PT_FEAT_DMA_INCOHERENT. This is the iommu device that created the6464+ * page table which must have dma ops that perform cache flushing.6565+ */6666+ struct device *iommu_device;6767+};6868+6969+/**7070+ * struct pt_iommu_info - Details about the IOMMU page table7171+ *7272+ * Returned from pt_iommu_ops->get_info()7373+ */7474+struct pt_iommu_info {7575+ /**7676+ * @pgsize_bitmap: A bitmask where each set bit indicates7777+ * a page size that can be natively stored in the page table.7878+ */7979+ u64 pgsize_bitmap;8080+};8181+8282+struct pt_iommu_ops {8383+ /**8484+ * @set_dirty: Make the iova write dirty8585+ * @iommu_table: Table to manipulate8686+ * @iova: IO virtual address to start8787+ *8888+ * This is only used by iommufd testing. It makes the iova dirty so that8989+ * read_and_clear_dirty() will see it as dirty. Unlike all the other ops9090+ * this one is safe to call without holding any locking. It may return9191+ * -EAGAIN if there is a race.9292+ */9393+ int (*set_dirty)(struct pt_iommu *iommu_table, dma_addr_t iova);9494+9595+ /**9696+ * @get_info: Return the pt_iommu_info structure9797+ * @iommu_table: Table to query9898+ *9999+ * Return some basic static information about the page table.100100+ */101101+ void (*get_info)(struct pt_iommu *iommu_table,102102+ struct pt_iommu_info *info);103103+104104+ /**105105+ * @deinit: Undo a format specific init operation106106+ * @iommu_table: Table to destroy107107+ *108108+ * Release all of the memory. The caller must have already removed the109109+ * table from all HW access and all caches.110110+ */111111+ void (*deinit)(struct pt_iommu *iommu_table);112112+};113113+114114+/**115115+ * struct pt_iommu_driver_ops - HW IOTLB cache flushing operations116116+ *117117+ * The IOMMU driver should implement these using container_of(iommu_table) to118118+ * get to it's iommu_domain derived structure. All ops can be called in atomic119119+ * contexts as they are buried under DMA API calls.120120+ */121121+struct pt_iommu_driver_ops {122122+ /**123123+ * @change_top: Update the top of table pointer124124+ * @iommu_table: Table to operate on125125+ * @top_paddr: New CPU physical address of the top pointer126126+ * @top_level: IOMMU PT level of the new top127127+ *128128+ * Called under the get_top_lock() spinlock. The driver must update all129129+ * HW references to this domain with a new top address and130130+ * configuration. On return mappings placed in the new top must be131131+ * reachable by the HW.132132+ *133133+ * top_level encodes the level in IOMMU PT format, level 0 is the134134+ * smallest page size increasing from there. This has to be translated135135+ * to any HW specific format. During this call the new top will not be136136+ * visible to any other API.137137+ *138138+ * This op is only used by PT_FEAT_DYNAMIC_TOP, and is required if139139+ * enabled.140140+ */141141+ void (*change_top)(struct pt_iommu *iommu_table, phys_addr_t top_paddr,142142+ unsigned int top_level);143143+144144+ /**145145+ * @get_top_lock: lock to hold when changing the table top146146+ * @iommu_table: Table to operate on147147+ *148148+ * Return a lock to hold when changing the table top page table from149149+ * being stored in HW. The lock will be held prior to calling150150+ * change_top() and released once the top is fully visible.151151+ *152152+ * Typically this would be a lock that protects the iommu_domain's153153+ * attachment list.154154+ *155155+ * This op is only used by PT_FEAT_DYNAMIC_TOP, and is required if156156+ * enabled.157157+ */158158+ spinlock_t *(*get_top_lock)(struct pt_iommu *iommu_table);159159+};160160+161161+static inline void pt_iommu_deinit(struct pt_iommu *iommu_table)162162+{163163+ /*164164+ * It is safe to call pt_iommu_deinit() before an init, or if init165165+ * fails. The ops pointer will only become non-NULL if deinit needs to be166166+ * run.167167+ */168168+ if (iommu_table->ops)169169+ iommu_table->ops->deinit(iommu_table);170170+}171171+172172+/**173173+ * struct pt_iommu_cfg - Common configuration values for all formats174174+ */175175+struct pt_iommu_cfg {176176+ /**177177+ * @features: Features required. Only these features will be turned on.178178+ * The feature list should reflect what the IOMMU HW is capable of.179179+ */180180+ unsigned int features;181181+ /**182182+ * @hw_max_vasz_lg2: Maximum VA the IOMMU HW can support. This will183183+ * imply the top level of the table.184184+ */185185+ u8 hw_max_vasz_lg2;186186+ /**187187+ * @hw_max_oasz_lg2: Maximum OA the IOMMU HW can support. The format188188+ * might select a lower maximum OA.189189+ */190190+ u8 hw_max_oasz_lg2;191191+};192192+193193+/* Generate the exported function signatures from iommu_pt.h */194194+#define IOMMU_PROTOTYPES(fmt) \195195+ phys_addr_t pt_iommu_##fmt##_iova_to_phys(struct iommu_domain *domain, \196196+ dma_addr_t iova); \197197+ int pt_iommu_##fmt##_map_pages(struct iommu_domain *domain, \198198+ unsigned long iova, phys_addr_t paddr, \199199+ size_t pgsize, size_t pgcount, \200200+ int prot, gfp_t gfp, size_t *mapped); \201201+ size_t pt_iommu_##fmt##_unmap_pages( \202202+ struct iommu_domain *domain, unsigned long iova, \203203+ size_t pgsize, size_t pgcount, \204204+ struct iommu_iotlb_gather *iotlb_gather); \205205+ int pt_iommu_##fmt##_read_and_clear_dirty( \206206+ struct iommu_domain *domain, unsigned long iova, size_t size, \207207+ unsigned long flags, struct iommu_dirty_bitmap *dirty); \208208+ int pt_iommu_##fmt##_init(struct pt_iommu_##fmt *table, \209209+ const struct pt_iommu_##fmt##_cfg *cfg, \210210+ gfp_t gfp); \211211+ void pt_iommu_##fmt##_hw_info(struct pt_iommu_##fmt *table, \212212+ struct pt_iommu_##fmt##_hw_info *info)213213+#define IOMMU_FORMAT(fmt, member) \214214+ struct pt_iommu_##fmt { \215215+ struct pt_iommu iommu; \216216+ struct pt_##fmt member; \217217+ }; \218218+ IOMMU_PROTOTYPES(fmt)219219+220220+/*221221+ * A driver uses IOMMU_PT_DOMAIN_OPS to populate the iommu_domain_ops for the222222+ * iommu_pt223223+ */224224+#define IOMMU_PT_DOMAIN_OPS(fmt) \225225+ .iova_to_phys = &pt_iommu_##fmt##_iova_to_phys, \226226+ .map_pages = &pt_iommu_##fmt##_map_pages, \227227+ .unmap_pages = &pt_iommu_##fmt##_unmap_pages228228+#define IOMMU_PT_DIRTY_OPS(fmt) \229229+ .read_and_clear_dirty = &pt_iommu_##fmt##_read_and_clear_dirty230230+231231+/*232232+ * The driver should setup its domain struct like233233+ * union {234234+ * struct iommu_domain domain;235235+ * struct pt_iommu_xxx xx;236236+ * };237237+ * PT_IOMMU_CHECK_DOMAIN(struct mock_iommu_domain, xx.iommu, domain);238238+ *239239+ * Which creates an alias between driver_domain.domain and240240+ * driver_domain.xx.iommu.domain. This is to avoid a mass rename of existing241241+ * driver_domain.domain users.242242+ */243243+#define PT_IOMMU_CHECK_DOMAIN(s, pt_iommu_memb, domain_memb) \244244+ static_assert(offsetof(s, pt_iommu_memb.domain) == \245245+ offsetof(s, domain_memb))246246+247247+struct pt_iommu_amdv1_cfg {248248+ struct pt_iommu_cfg common;249249+ unsigned int starting_level;250250+};251251+252252+struct pt_iommu_amdv1_hw_info {253253+ u64 host_pt_root;254254+ u8 mode;255255+};256256+257257+IOMMU_FORMAT(amdv1, amdpt);258258+259259+/* amdv1_mock is used by the iommufd selftest */260260+#define pt_iommu_amdv1_mock pt_iommu_amdv1261261+#define pt_iommu_amdv1_mock_cfg pt_iommu_amdv1_cfg262262+struct pt_iommu_amdv1_mock_hw_info;263263+IOMMU_PROTOTYPES(amdv1_mock);264264+265265+struct pt_iommu_vtdss_cfg {266266+ struct pt_iommu_cfg common;267267+ /* 4 is a 57 bit 5 level table */268268+ unsigned int top_level;269269+};270270+271271+struct pt_iommu_vtdss_hw_info {272272+ u64 ssptptr;273273+ u8 aw;274274+};275275+276276+IOMMU_FORMAT(vtdss, vtdss_pt);277277+278278+struct pt_iommu_x86_64_cfg {279279+ struct pt_iommu_cfg common;280280+ /* 4 is a 57 bit 5 level table */281281+ unsigned int top_level;282282+};283283+284284+struct pt_iommu_x86_64_hw_info {285285+ u64 gcr3_pt;286286+ u8 levels;287287+};288288+289289+IOMMU_FORMAT(x86_64, x86_64_pt);290290+291291+#undef IOMMU_PROTOTYPES292292+#undef IOMMU_FORMAT293293+#endif