[v7] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

[PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

The CMDQV HW supports a user-space use for virtualization cases. It allows
the VM to issue guest-level TLBI or ATC_INV commands directly to the queue
and executes them without a VMEXIT, as HW will replace the VMID field in a
TLBI command and the SID field in an ATC_INV command with the preset VMID
and SID.

This is built upon the vIOMMU infrastructure by allowing VMM to allocate a
VINTF (as a vIOMMU object) and assign VCMDQs (HW QUEUE objs) to the VINTF.

So firstly, replace the standard vSMMU model with the VINTF implementation
but reuse the standard cache_invalidate op (for unsupported commands) and
the standard alloc_domain_nested op (for standard nested STE).

Each VINTF has two 64KB MMIO pages (128B per logical VCMDQ):
 - Page0 (directly accessed by guest) has all the control and status bits.
 - Page1 (trapped by VMM) has guest-owned queue memory location/size info.

VMM should trap the emulated VINTF0's page1 of the guest VM for the guest-
level VCMDQ location/size info and forward that to the kernel to translate
to a physical memory location to program the VCMDQ HW during an allocation
call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0
of the guest VM. This allows the guest OS to read and write the guest-own
VINTF's page0 for direct control of the VCMDQ HW.

For ATC invalidation commands that hold an SID, it requires all devices to
register their virtual SIDs to the SID_MATCH registers and their physical
SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
lookup table to replace those virtual SIDs with the correct physical SIDs.
Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
structure to allocate SID_REPLACE and to program the SIDs accordingly.

This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +
 include/uapi/linux/iommufd.h                  |  58 +++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |   6 +-
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 407 +++++++++++++++++-
 4 files changed, 471 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 836d5556008e..aa25156e04a3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1056,10 +1056,17 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
 void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
 void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master);
 int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
+struct iommu_domain *
+arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
+			      const struct iommu_user_data *user_data);
+int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
+			       struct iommu_user_data_array *array);
 #else
 #define arm_smmu_get_viommu_size NULL
 #define arm_smmu_hw_info NULL
 #define arm_vsmmu_init NULL
+#define arm_vsmmu_alloc_domain_nested NULL
+#define arm_vsmmu_cache_invalidate NULL
 
 static inline int
 arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 6ae9d2102154..1c9e486113e3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -591,6 +591,27 @@ struct iommu_hw_info_arm_smmuv3 {
 	__u32 aidr;
 };
 
+/**
+ * iommu_hw_info_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Hardware Information
+ *                                (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
+ * @flags: Must be 0
+ * @version: Version number for the CMDQ-V HW for PARAM bits[03:00]
+ * @log2vcmdqs: Log2 of the total number of VCMDQs for PARAM bits[07:04]
+ * @log2vsids: Log2 of the total number of SID replacements for PARAM bits[15:12]
+ * @__reserved: Must be 0
+ *
+ * VMM can use these fields directly in its emulated global PARAM register. Note
+ * that only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAM
+ * bits[11:08] should be set to 0 for log2 of the total number of VINTFs.
+ */
+struct iommu_hw_info_tegra241_cmdqv {
+	__u32 flags;
+	__u8 version;
+	__u8 log2vcmdqs;
+	__u8 log2vsids;
+	__u8 __reserved;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
@@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
  * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
  * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
+ * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                     SMMUv3) info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
 	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
+	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
 };
 
 /**
@@ -972,10 +996,29 @@ struct iommu_fault_alloc {
  * enum iommu_viommu_type - Virtual IOMMU Type
  * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
  * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
+ * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                    SMMUv3) Virtual Interface (VINTF)
  */
 enum iommu_viommu_type {
 	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
 	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
+	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
+};
+
+/**
+ * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface
+ *                                      (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
+ * @out_vintf_mmap_offset: mmap offset argument for VINTF's page0
+ * @out_vintf_mmap_length: mmap length argument for VINTF's page0
+ *
+ * Both @out_vintf_mmap_offset and @out_vintf_mmap_length are reported by kernel
+ * for user space to mmap the VINTF page0 from the host physical address space
+ * to the guest physical address space so that a guest kernel can directly R/W
+ * access to the VINTF page0 in order to control its virtual command queues.
+ */
+struct iommu_viommu_tegra241_cmdqv {
+	__aligned_u64 out_vintf_mmap_offset;
+	__aligned_u64 out_vintf_mmap_length;
 };
 
 /**
@@ -1172,9 +1215,24 @@ struct iommu_veventq_alloc {
 /**
  * enum iommu_hw_queue_type - HW Queue Type
  * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use
+ * @IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                      SMMUv3) Virtual Command Queue (VCMDQ)
  */
 enum iommu_hw_queue_type {
 	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
+	/*
+	 * TEGRA241_CMDQV requirements (otherwise, allocation will fail)
+	 * - alloc starts from the lowest @index=0 in ascending order
+	 * - destroy starts from the last allocated @index in descending order
+	 * - @base_addr must be aligned to @length in bytes and mapped in IOAS
+	 * - @length must be a power of 2, with a minimum 32 bytes and a maximum
+	 *   2 ^ idr[1].CMDQS * 16 bytes (use GET_HW_INFO call to read idr[1]
+	 *   from struct iommu_hw_info_arm_smmuv3)
+	 * - suggest to back the queue memory with contiguous physical pages or
+	 *   a single huge page with alignment of the queue size, and limit the
+	 *   emulated vSMMU's IDR1.CMDQS to log2(huge page size / 16 bytes)
+	 */
+	IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
 };
 
 /**
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 1cf9646e776f..d9bea8f1f636 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -225,7 +225,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg,
 	return 0;
 }
 
-static struct iommu_domain *
+struct iommu_domain *
 arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
 			      const struct iommu_user_data *user_data)
 {
@@ -336,8 +336,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
 	return 0;
 }
 
-static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
-				      struct iommu_user_data_array *array)
+int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
+			       struct iommu_user_data_array *array)
 {
 	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
 	struct arm_smmu_device *smmu = vsmmu->smmu;
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 869c90b660c1..e073b64553d5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -8,7 +8,9 @@
 #include <linux/dma-mapping.h>
 #include <linux/interrupt.h>
 #include <linux/iommu.h>
+#include <linux/iommufd.h>
 #include <linux/iopoll.h>
+#include <uapi/linux/iommufd.h>
 
 #include <acpi/acpixf.h>
 
@@ -26,8 +28,10 @@
 #define  CMDQV_EN			BIT(0)
 
 #define TEGRA241_CMDQV_PARAM		0x0004
+#define  CMDQV_NUM_SID_PER_VM_LOG2	GENMASK(15, 12)
 #define  CMDQV_NUM_VINTF_LOG2		GENMASK(11, 8)
 #define  CMDQV_NUM_VCMDQ_LOG2		GENMASK(7, 4)
+#define  CMDQV_VER			GENMASK(3, 0)
 
 #define TEGRA241_CMDQV_STATUS		0x0008
 #define  CMDQV_ENABLED			BIT(0)
@@ -53,6 +57,9 @@
 #define  VINTF_STATUS			GENMASK(3, 1)
 #define  VINTF_ENABLED			BIT(0)
 
+#define TEGRA241_VINTF_SID_MATCH(s)	(0x0040 + 0x4*(s))
+#define TEGRA241_VINTF_SID_REPLACE(s)	(0x0080 + 0x4*(s))
+
 #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \
 					(0x00C0 + 0x8*(m))
 #define  LVCMDQ_ERR_MAP_NUM_64		2
@@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
 
 /**
  * struct tegra241_vcmdq - Virtual Command Queue
+ * @core: Embedded iommufd_hw_queue structure
  * @idx: Global index in the CMDQV
  * @lidx: Local index in the VINTF
  * @enabled: Enable status
  * @cmdqv: Parent CMDQV pointer
  * @vintf: Parent VINTF pointer
+ * @prev: Previous LVCMDQ to depend on
  * @cmdq: Command Queue struct
  * @page0: MMIO Page0 base address
  * @page1: MMIO Page1 base address
  */
 struct tegra241_vcmdq {
+	struct iommufd_hw_queue core;
+
 	u16 idx;
 	u16 lidx;
 
@@ -131,22 +142,30 @@ struct tegra241_vcmdq {
 
 	struct tegra241_cmdqv *cmdqv;
 	struct tegra241_vintf *vintf;
+	struct tegra241_vcmdq *prev;
 	struct arm_smmu_cmdq cmdq;
 
 	void __iomem *page0;
 	void __iomem *page1;
 };
+#define hw_queue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
 
 /**
  * struct tegra241_vintf - Virtual Interface
+ * @vsmmu: Embedded arm_vsmmu structure
  * @idx: Global index in the CMDQV
  * @enabled: Enable status
  * @hyp_own: Owned by hypervisor (in-kernel)
  * @cmdqv: Parent CMDQV pointer
  * @lvcmdqs: List of logical VCMDQ pointers
+ * @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs
  * @base: MMIO base address
+ * @mmap_offset: Offset argument for mmap() syscall
+ * @sids: Stream ID replacement resources
  */
 struct tegra241_vintf {
+	struct arm_vsmmu vsmmu;
+
 	u16 idx;
 
 	bool enabled;
@@ -154,19 +173,41 @@ struct tegra241_vintf {
 
 	struct tegra241_cmdqv *cmdqv;
 	struct tegra241_vcmdq **lvcmdqs;
+	struct mutex lvcmdq_mutex; /* user space race */
 
 	void __iomem *base;
+	unsigned long mmap_offset;
+
+	struct ida sids;
+};
+#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core)
+
+/**
+ * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
+ * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
+ * @vintf: Parent VINTF pointer
+ * @sid: Physical Stream ID
+ * @idx: Replacement index in the VINTF
+ */
+struct tegra241_vintf_sid {
+	struct iommufd_vdevice core;
+	struct tegra241_vintf *vintf;
+	u32 sid;
+	u8 idx;
 };
+#define vdev_to_vsid(v) container_of(v, struct tegra241_vintf_sid, core)
 
 /**
  * struct tegra241_cmdqv - CMDQ-V for SMMUv3
  * @smmu: SMMUv3 device
  * @dev: CMDQV device
  * @base: MMIO base address
+ * @base_phys: MMIO physical base address, for mmap
  * @irq: IRQ number
  * @num_vintfs: Total number of VINTFs
  * @num_vcmdqs: Total number of VCMDQs
  * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF
+ * @num_sids_per_vintf: Total number of SID replacements per VINTF
  * @vintf_ids: VINTF id allocator
  * @vintfs: List of VINTFs
  */
@@ -175,12 +216,14 @@ struct tegra241_cmdqv {
 	struct device *dev;
 
 	void __iomem *base;
+	phys_addr_t base_phys;
 	int irq;
 
 	/* CMDQV Hardware Params */
 	u16 num_vintfs;
 	u16 num_vcmdqs;
 	u16 num_lvcmdqs_per_vintf;
+	u16 num_sids_per_vintf;
 
 	struct ida vintf_ids;
 
@@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
 
 /* HW Reset Functions */
 
+/*
+ * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
+ * following an ATC_INV command at the end of the guest queue while this ATC_INV
+ * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
+ * to the next VM, which will be a false alarm potentially causing some unwanted
+ * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
+ * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
+ */
+static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
+{
+	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
+	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
+
+	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
+		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
+
+	/*
+	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
+	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
+	 */
+	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
+}
+
 /* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */
 static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 {
@@ -364,6 +430,8 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR)),
 			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, CONS)));
 	}
+	tegra241_vcmdq_hw_flush_timeout(vcmdq);
+
 	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, PROD));
 	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, CONS));
 	writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE));
@@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
 }
 
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
+static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
+{
+	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
+}
+
 /* This function is for LVCMDQ, so @vcmdq must be mapped prior */
 static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 {
@@ -390,7 +464,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 	tegra241_vcmdq_hw_deinit(vcmdq);
 
 	/* Configure and enable VCMDQ */
-	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
+	_tegra241_vcmdq_hw_init(vcmdq);
 
 	ret = vcmdq_write_config(vcmdq, VCMDQ_EN);
 	if (ret) {
@@ -420,6 +494,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq)
 static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
 {
 	u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
+	int sidx;
 
 	/* HW requires to unmap LVCMDQs in descending order */
 	while (lidx--) {
@@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
 		}
 	}
 	vintf_write_config(vintf, 0);
+	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
+		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
+		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
+	}
 }
 
 /* Map a global VCMDQ to the pre-assigned LVCMDQ */
@@ -457,7 +536,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
 	 * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a
 	 * restricted set of supported commands.
 	 */
-	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own);
+	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) |
+		 FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid);
 	writel(regval, REG_VINTF(vintf, CONFIG));
 
 	ret = vintf_write_config(vintf, regval | VINTF_EN);
@@ -584,7 +664,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
 
 	dev_dbg(vintf->cmdqv->dev,
 		"%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64));
-	kfree(vcmdq);
+	/* Guest-owned VCMDQ is free-ed with hw_queue by iommufd core */
+	if (vcmdq->vintf->hyp_own)
+		kfree(vcmdq);
 }
 
 static struct tegra241_vcmdq *
@@ -671,7 +753,13 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
 
 	dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx);
 	tegra241_cmdqv_deinit_vintf(cmdqv, idx);
-	kfree(vintf);
+	if (!vintf->hyp_own) {
+		mutex_destroy(&vintf->lvcmdq_mutex);
+		ida_destroy(&vintf->sids);
+		/* Guest-owned VINTF is free-ed with viommu by iommufd core */
+	} else {
+		kfree(vintf);
+	}
 }
 
 static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
@@ -699,10 +787,45 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
 	put_device(cmdqv->dev); /* smmu->impl_dev */
 }
 
+static int
+tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
+			       const struct iommu_user_data *user_data);
+
+static void *tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *length,
+				    enum iommu_hw_info_type *type)
+{
+	struct tegra241_cmdqv *cmdqv =
+		container_of(smmu, struct tegra241_cmdqv, smmu);
+	struct iommu_hw_info_tegra241_cmdqv *info;
+	u32 regval;
+
+	if (*type != IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM));
+	info->log2vcmdqs = ilog2(cmdqv->num_lvcmdqs_per_vintf);
+	info->log2vsids = ilog2(cmdqv->num_sids_per_vintf);
+	info->version = FIELD_GET(CMDQV_VER, regval);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV;
+	return info;
+}
+
 static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = {
+	/* For in-kernel use */
 	.get_secondary_cmdq = tegra241_cmdqv_get_cmdq,
 	.device_reset = tegra241_cmdqv_hw_reset,
 	.device_remove = tegra241_cmdqv_remove,
+	/* For user-space use */
+	.hw_info = tegra241_cmdqv_hw_info,
+	.vsmmu_size = VIOMMU_STRUCT_SIZE(struct tegra241_vintf, vsmmu.core),
+	.vsmmu_type = IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+	.vsmmu_init = tegra241_cmdqv_init_vintf_user,
 };
 
 /* Probe Functions */
@@ -844,6 +967,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
 	cmdqv->irq = irq;
 	cmdqv->base = base;
 	cmdqv->dev = smmu->impl_dev;
+	cmdqv->base_phys = res->start;
 
 	if (cmdqv->irq > 0) {
 		ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr,
@@ -860,6 +984,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
 	cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval);
 	cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval);
 	cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs;
+	cmdqv->num_sids_per_vintf =
+		1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
 
 	cmdqv->vintfs =
 		kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL);
@@ -913,3 +1039,276 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
 	put_device(smmu->impl_dev);
 	return ERR_PTR(-ENODEV);
 }
+
+/* User space VINTF and VCMDQ Functions */
+
+static size_t tegra241_vintf_get_vcmdq_size(struct iommufd_viommu *viommu,
+					    enum iommu_hw_queue_type queue_type)
+{
+	if (queue_type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
+		return 0;
+	return HW_QUEUE_STRUCT_SIZE(struct tegra241_vcmdq, core);
+}
+
+static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
+{
+	char header[64];
+
+	/* Configure the vcmdq only; User space does the enabling */
+	_tegra241_vcmdq_hw_init(vcmdq);
+
+	dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n",
+		lvcmdq_error_header(vcmdq, header, 64),
+		vcmdq->cmdq.q.q_base & VCMDQ_ADDR,
+		1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE));
+	return 0;
+}
+
+static void
+tegra241_vintf_destroy_lvcmdq_user(struct iommufd_hw_queue *hw_queue)
+{
+	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
+
+	mutex_lock(&vcmdq->vintf->lvcmdq_mutex);
+	tegra241_vcmdq_hw_deinit(vcmdq);
+	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
+	tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx);
+	if (vcmdq->prev)
+		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
+	mutex_unlock(&vcmdq->vintf->lvcmdq_mutex);
+}
+
+static int tegra241_vintf_alloc_lvcmdq_user(struct iommufd_hw_queue *hw_queue,
+					    u32 lidx, phys_addr_t base_addr_pa)
+{
+	struct tegra241_vintf *vintf = viommu_to_vintf(hw_queue->viommu);
+	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
+	struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
+	struct arm_smmu_device *smmu = &cmdqv->smmu;
+	struct tegra241_vcmdq *prev = NULL;
+	u32 log2size, max_n_shift;
+	char header[64];
+	int ret;
+
+	if (hw_queue->type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
+		return -EOPNOTSUPP;
+	if (lidx >= cmdqv->num_lvcmdqs_per_vintf)
+		return -EINVAL;
+
+	mutex_lock(&vintf->lvcmdq_mutex);
+
+	if (vintf->lvcmdqs[lidx]) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+
+	/*
+	 * HW requires to map LVCMDQs in ascending order, so reject if the
+	 * previous lvcmdqs is not allocated yet.
+	 */
+	if (lidx) {
+		prev = vintf->lvcmdqs[lidx - 1];
+		if (!prev) {
+			ret = -EIO;
+			goto unlock;
+		}
+	}
+
+	/*
+	 * hw_queue->length must be a power of 2, in range of
+	 *   [ 32, 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
+	 */
+	max_n_shift = FIELD_GET(IDR1_CMDQS,
+				readl_relaxed(smmu->base + ARM_SMMU_IDR1));
+	if (!is_power_of_2(hw_queue->length) || hw_queue->length < 32 ||
+	    hw_queue->length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+	log2size = ilog2(hw_queue->length) - CMDQ_ENT_SZ_SHIFT;
+
+	/* base_addr_pa must be aligned to hw_queue->length */
+	if (base_addr_pa & ~VCMDQ_ADDR ||
+	    base_addr_pa & (hw_queue->length - 1)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * HW requires to unmap LVCMDQs in descending order, so destroy() must
+	 * follow this rule. Set a dependency on its previous LVCMDQ so iommufd
+	 * core will help enforce it.
+	 */
+	if (prev) {
+		ret = iommufd_hw_queue_depend(vcmdq, prev, core);
+		if (ret)
+			goto unlock;
+	}
+	vcmdq->prev = prev;
+
+	ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
+	if (ret)
+		goto undepend_vcmdq;
+
+	dev_dbg(cmdqv->dev, "%sallocated\n",
+		lvcmdq_error_header(vcmdq, header, 64));
+
+	tegra241_vcmdq_map_lvcmdq(vcmdq);
+
+	vcmdq->cmdq.q.q_base = base_addr_pa & VCMDQ_ADDR;
+	vcmdq->cmdq.q.q_base |= log2size;
+
+	ret = tegra241_vcmdq_hw_init_user(vcmdq);
+	if (ret)
+		goto unmap_lvcmdq;
+
+	hw_queue->destroy = &tegra241_vintf_destroy_lvcmdq_user;
+	mutex_unlock(&vintf->lvcmdq_mutex);
+	return 0;
+
+unmap_lvcmdq:
+	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
+	tegra241_vintf_deinit_lvcmdq(vintf, lidx);
+undepend_vcmdq:
+	if (vcmdq->prev)
+		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
+unlock:
+	mutex_unlock(&vintf->lvcmdq_mutex);
+	return ret;
+}
+
+static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu)
+{
+	struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
+
+	if (vintf->mmap_offset)
+		iommufd_viommu_destroy_mmap(&vintf->vsmmu.core,
+					    vintf->mmap_offset);
+	tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx);
+}
+
+static void tegra241_vintf_destroy_vsid(struct iommufd_vdevice *vdev)
+{
+	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
+	struct tegra241_vintf *vintf = vsid->vintf;
+
+	writel(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));
+	writel(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
+	ida_free(&vintf->sids, vsid->idx);
+	dev_dbg(vintf->cmdqv->dev,
+		"VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx,
+		vsid->idx, vsid->sid);
+}
+
+static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(vdev->dev);
+	struct tegra241_vintf *vintf = viommu_to_vintf(vdev->viommu);
+	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
+	struct arm_smmu_stream *stream = &master->streams[0];
+	u64 virt_sid = vdev->virt_id;
+	int sidx;
+
+	if (virt_sid > UINT_MAX)
+		return -EINVAL;
+
+	WARN_ON_ONCE(master->num_streams != 1);
+
+	/* Find an empty pair of SID_REPLACE and SID_MATCH */
+	sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1,
+			     GFP_KERNEL);
+	if (sidx < 0)
+		return sidx;
+
+	writel(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx)));
+	writel(virt_sid << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx)));
+	dev_dbg(vintf->cmdqv->dev,
+		"VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n",
+		vintf->idx, sidx, stream->id, (u32)virt_sid);
+
+	vsid->idx = sidx;
+	vsid->vintf = vintf;
+	vsid->sid = stream->id;
+
+	vdev->destroy = &tegra241_vintf_destroy_vsid;
+	return 0;
+}
+
+static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
+	.destroy = tegra241_cmdqv_destroy_vintf_user,
+	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
+	.cache_invalidate = arm_vsmmu_cache_invalidate,
+	.vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core),
+	.vdevice_init = tegra241_vintf_init_vsid,
+	.get_hw_queue_size = tegra241_vintf_get_vcmdq_size,
+	.hw_queue_init_phys = tegra241_vintf_alloc_lvcmdq_user,
+};
+
+static int
+tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
+			       const struct iommu_user_data *user_data)
+{
+	struct tegra241_cmdqv *cmdqv =
+		container_of(vsmmu->smmu, struct tegra241_cmdqv, smmu);
+	struct tegra241_vintf *vintf = viommu_to_vintf(&vsmmu->core);
+	struct iommu_viommu_tegra241_cmdqv data;
+	phys_addr_t page0_base;
+	int ret;
+
+	if (!user_data)
+		return -EINVAL;
+
+	ret = iommu_copy_struct_from_user(&data, user_data,
+					  IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+					  out_vintf_mmap_length);
+	if (ret)
+		return ret;
+
+	ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf);
+	if (ret < 0) {
+		dev_err(cmdqv->dev, "no more available vintf\n");
+		return ret;
+	}
+
+	/*
+	 * Initialize the user-owned VINTF without a LVCMDQ, as it cannot pre-
+	 * allocate a LVCMDQ until user space wants one, for security reasons.
+	 * It is different than the kernel-owned VINTF0, which had pre-assigned
+	 * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs
+	 * by the tegra241_vintf_hw_init() call.
+	 */
+	ret = tegra241_vintf_hw_init(vintf, false);
+	if (ret)
+		goto deinit_vintf;
+
+	page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx);
+	ret = iommufd_viommu_alloc_mmap(&vintf->vsmmu.core, page0_base, SZ_64K,
+					&vintf->mmap_offset);
+	if (ret)
+		goto hw_deinit_vintf;
+
+	data.out_vintf_mmap_length = SZ_64K;
+	data.out_vintf_mmap_offset = vintf->mmap_offset;
+	ret = iommu_copy_struct_to_user(user_data, &data,
+					IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+					out_vintf_mmap_length);
+	if (ret)
+		goto free_mmap;
+
+	ida_init(&vintf->sids);
+	mutex_init(&vintf->lvcmdq_mutex);
+
+	dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx,
+		vintf->vsmmu.vmid);
+
+	vsmmu->core.ops = &tegra241_cmdqv_viommu_ops;
+	return 0;
+
+free_mmap:
+	iommufd_viommu_destroy_mmap(&vintf->vsmmu.core, vintf->mmap_offset);
+hw_deinit_vintf:
+	tegra241_vintf_hw_deinit(vintf);
+deinit_vintf:
+	tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx);
+	return ret;
+}
-- 
2.43.0

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> The CMDQV HW supports a user-space use for virtualization cases. It allows
> the VM to issue guest-level TLBI or ATC_INV commands directly to the queue
> and executes them without a VMEXIT, as HW will replace the VMID field in a
> TLBI command and the SID field in an ATC_INV command with the preset VMID
> and SID.
> 
> This is built upon the vIOMMU infrastructure by allowing VMM to allocate a
> VINTF (as a vIOMMU object) and assign VCMDQs (HW QUEUE objs) to the VINTF.
> 
> So firstly, replace the standard vSMMU model with the VINTF implementation
> but reuse the standard cache_invalidate op (for unsupported commands) and
> the standard alloc_domain_nested op (for standard nested STE).
> 
> Each VINTF has two 64KB MMIO pages (128B per logical VCMDQ):
>  - Page0 (directly accessed by guest) has all the control and status bits.
>  - Page1 (trapped by VMM) has guest-owned queue memory location/size info.
> 
> VMM should trap the emulated VINTF0's page1 of the guest VM for the guest-
> level VCMDQ location/size info and forward that to the kernel to translate
> to a physical memory location to program the VCMDQ HW during an allocation
> call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0
> of the guest VM. This allows the guest OS to read and write the guest-own
> VINTF's page0 for direct control of the VCMDQ HW.
> 
> For ATC invalidation commands that hold an SID, it requires all devices to
> register their virtual SIDs to the SID_MATCH registers and their physical
> SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
> lookup table to replace those virtual SIDs with the correct physical SIDs.
> Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
> structure to allocate SID_REPLACE and to program the SIDs accordingly.
> 
> This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
> the standard SMMUv3 operating in the nested translation mode trapping CMDQ
> for TLBI and ATC_INV commands, this gives a huge performance improvement:
> 70% to 90% reductions of invalidation time were measured by various DMA
> unmap tests running in a guest OS.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +
>  include/uapi/linux/iommufd.h                  |  58 +++
>  .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |   6 +-
>  .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 407 +++++++++++++++++-
>  4 files changed, 471 insertions(+), 7 deletions(-)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com> with the following
nits:

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 836d5556008e..aa25156e04a3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -1056,10 +1056,17 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
>  void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
>  void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master);
>  int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
> +struct iommu_domain *
> +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
> +			      const struct iommu_user_data *user_data);
> +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> +			       struct iommu_user_data_array *array);
>  #else
>  #define arm_smmu_get_viommu_size NULL
>  #define arm_smmu_hw_info NULL
>  #define arm_vsmmu_init NULL
> +#define arm_vsmmu_alloc_domain_nested NULL
> +#define arm_vsmmu_cache_invalidate NULL
>  
>  static inline int
>  arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 6ae9d2102154..1c9e486113e3 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -591,6 +591,27 @@ struct iommu_hw_info_arm_smmuv3 {
>  	__u32 aidr;
>  };
>  
> +/**
> + * iommu_hw_info_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Hardware Information
> + *                                (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
> + * @flags: Must be 0
> + * @version: Version number for the CMDQ-V HW for PARAM bits[03:00]
> + * @log2vcmdqs: Log2 of the total number of VCMDQs for PARAM bits[07:04]
> + * @log2vsids: Log2 of the total number of SID replacements for PARAM bits[15:12]
> + * @__reserved: Must be 0
> + *
> + * VMM can use these fields directly in its emulated global PARAM register. Note
> + * that only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAM
> + * bits[11:08] should be set to 0 for log2 of the total number of VINTFs.
> + */
> +struct iommu_hw_info_tegra241_cmdqv {
> +	__u32 flags;
> +	__u8 version;
> +	__u8 log2vcmdqs;
> +	__u8 log2vsids;
> +	__u8 __reserved;
> +};
> +
>  /**
>   * enum iommu_hw_info_type - IOMMU Hardware Info Types
>   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
>   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
>   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
>   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                     SMMUv3) info type

I know that the goal here is to mention that Tegra241 CMDQV is an
extension for Arm SMMUv3, but this comment could be misunderstood as the
"type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we
get rid of the word "extension" and have something like:

 * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV info type (NVIDIA's 
 * 				       implementation of Arm SMMUv3 with CMDQ 
 *				       Virtualization)

Sorry to be nit-picky here, I know that the code is clear, but I've seen
people don't care to read more than the uapi descriptions. Maybe we
could re-write this comment, here and everywhere else?

>   */
>  enum iommu_hw_info_type {
>  	IOMMU_HW_INFO_TYPE_NONE = 0,
>  	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
>  	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
>  	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
> +	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
>  };
>  
>  /**
> @@ -972,10 +996,29 @@ struct iommu_fault_alloc {
>   * enum iommu_viommu_type - Virtual IOMMU Type
>   * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
>   * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
> + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                    SMMUv3) Virtual Interface (VINTF)
>   */
>  enum iommu_viommu_type {
>  	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
>  	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
> +	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
> +};
> +
> +/**
> + * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface
> + *                                      (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
> + * @out_vintf_mmap_offset: mmap offset argument for VINTF's page0
> + * @out_vintf_mmap_length: mmap length argument for VINTF's page0
> + *
> + * Both @out_vintf_mmap_offset and @out_vintf_mmap_length are reported by kernel
> + * for user space to mmap the VINTF page0 from the host physical address space
> + * to the guest physical address space so that a guest kernel can directly R/W
> + * access to the VINTF page0 in order to control its virtual command queues.
> + */
> +struct iommu_viommu_tegra241_cmdqv {
> +	__aligned_u64 out_vintf_mmap_offset;
> +	__aligned_u64 out_vintf_mmap_length;
>  };
>  
>  /**
> @@ -1172,9 +1215,24 @@ struct iommu_veventq_alloc {
>  /**
>   * enum iommu_hw_queue_type - HW Queue Type
>   * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use
> + * @IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                      SMMUv3) Virtual Command Queue (VCMDQ)
>   */
>  enum iommu_hw_queue_type {
>  	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
> +	/*
> +	 * TEGRA241_CMDQV requirements (otherwise, allocation will fail)
> +	 * - alloc starts from the lowest @index=0 in ascending order
> +	 * - destroy starts from the last allocated @index in descending order
> +	 * - @base_addr must be aligned to @length in bytes and mapped in IOAS
> +	 * - @length must be a power of 2, with a minimum 32 bytes and a maximum
> +	 *   2 ^ idr[1].CMDQS * 16 bytes (use GET_HW_INFO call to read idr[1]
> +	 *   from struct iommu_hw_info_arm_smmuv3)
> +	 * - suggest to back the queue memory with contiguous physical pages or
> +	 *   a single huge page with alignment of the queue size, and limit the
> +	 *   emulated vSMMU's IDR1.CMDQS to log2(huge page size / 16 bytes)
> +	 */
> +	IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
>  };
>  
>  /**
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 1cf9646e776f..d9bea8f1f636 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -225,7 +225,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg,
>  	return 0;
>  }
>  
> -static struct iommu_domain *
> +struct iommu_domain *
>  arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
>  			      const struct iommu_user_data *user_data)
>  {
> @@ -336,8 +336,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
>  	return 0;
>  }
>  
> -static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> -				      struct iommu_user_data_array *array)
> +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> +			       struct iommu_user_data_array *array)
>  {
>  	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
>  	struct arm_smmu_device *smmu = vsmmu->smmu;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> index 869c90b660c1..e073b64553d5 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> @@ -8,7 +8,9 @@
>  #include <linux/dma-mapping.h>
>  #include <linux/interrupt.h>
>  #include <linux/iommu.h>
> +#include <linux/iommufd.h>
>  #include <linux/iopoll.h>
> +#include <uapi/linux/iommufd.h>
>  
>  #include <acpi/acpixf.h>
>  
> @@ -26,8 +28,10 @@
>  #define  CMDQV_EN			BIT(0)
>  
>  #define TEGRA241_CMDQV_PARAM		0x0004
> +#define  CMDQV_NUM_SID_PER_VM_LOG2	GENMASK(15, 12)
>  #define  CMDQV_NUM_VINTF_LOG2		GENMASK(11, 8)
>  #define  CMDQV_NUM_VCMDQ_LOG2		GENMASK(7, 4)
> +#define  CMDQV_VER			GENMASK(3, 0)
>  
>  #define TEGRA241_CMDQV_STATUS		0x0008
>  #define  CMDQV_ENABLED			BIT(0)
> @@ -53,6 +57,9 @@
>  #define  VINTF_STATUS			GENMASK(3, 1)
>  #define  VINTF_ENABLED			BIT(0)
>  
> +#define TEGRA241_VINTF_SID_MATCH(s)	(0x0040 + 0x4*(s))
> +#define TEGRA241_VINTF_SID_REPLACE(s)	(0x0080 + 0x4*(s))
> +
>  #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \
>  					(0x00C0 + 0x8*(m))
>  #define  LVCMDQ_ERR_MAP_NUM_64		2
> @@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
>  
>  /**
>   * struct tegra241_vcmdq - Virtual Command Queue
> + * @core: Embedded iommufd_hw_queue structure
>   * @idx: Global index in the CMDQV
>   * @lidx: Local index in the VINTF
>   * @enabled: Enable status
>   * @cmdqv: Parent CMDQV pointer
>   * @vintf: Parent VINTF pointer
> + * @prev: Previous LVCMDQ to depend on
>   * @cmdq: Command Queue struct
>   * @page0: MMIO Page0 base address
>   * @page1: MMIO Page1 base address
>   */
>  struct tegra241_vcmdq {
> +	struct iommufd_hw_queue core;
> +
>  	u16 idx;
>  	u16 lidx;
>  
> @@ -131,22 +142,30 @@ struct tegra241_vcmdq {
>  
>  	struct tegra241_cmdqv *cmdqv;
>  	struct tegra241_vintf *vintf;
> +	struct tegra241_vcmdq *prev;
>  	struct arm_smmu_cmdq cmdq;
>  
>  	void __iomem *page0;
>  	void __iomem *page1;
>  };
> +#define hw_queue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
>  
>  /**
>   * struct tegra241_vintf - Virtual Interface
> + * @vsmmu: Embedded arm_vsmmu structure
>   * @idx: Global index in the CMDQV
>   * @enabled: Enable status
>   * @hyp_own: Owned by hypervisor (in-kernel)
>   * @cmdqv: Parent CMDQV pointer
>   * @lvcmdqs: List of logical VCMDQ pointers
> + * @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs
>   * @base: MMIO base address
> + * @mmap_offset: Offset argument for mmap() syscall
> + * @sids: Stream ID replacement resources
>   */
>  struct tegra241_vintf {
> +	struct arm_vsmmu vsmmu;
> +
>  	u16 idx;
>  
>  	bool enabled;
> @@ -154,19 +173,41 @@ struct tegra241_vintf {
>  
>  	struct tegra241_cmdqv *cmdqv;
>  	struct tegra241_vcmdq **lvcmdqs;
> +	struct mutex lvcmdq_mutex; /* user space race */
>  
>  	void __iomem *base;
> +	unsigned long mmap_offset;
> +
> +	struct ida sids;
> +};
> +#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core)
> +
> +/**
> + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> + * @vintf: Parent VINTF pointer
> + * @sid: Physical Stream ID
> + * @idx: Replacement index in the VINTF
> + */
> +struct tegra241_vintf_sid {
> +	struct iommufd_vdevice core;
> +	struct tegra241_vintf *vintf;
> +	u32 sid;
> +	u8 idx;
>  };

AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?

> +#define vdev_to_vsid(v) container_of(v, struct tegra241_vintf_sid, core)
>  
>  /**
>   * struct tegra241_cmdqv - CMDQ-V for SMMUv3
>   * @smmu: SMMUv3 device
>   * @dev: CMDQV device
>   * @base: MMIO base address
> + * @base_phys: MMIO physical base address, for mmap
>   * @irq: IRQ number
>   * @num_vintfs: Total number of VINTFs
>   * @num_vcmdqs: Total number of VCMDQs
>   * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF
> + * @num_sids_per_vintf: Total number of SID replacements per VINTF
>   * @vintf_ids: VINTF id allocator
>   * @vintfs: List of VINTFs
>   */
> @@ -175,12 +216,14 @@ struct tegra241_cmdqv {
>  	struct device *dev;
>  
>  	void __iomem *base;
> +	phys_addr_t base_phys;
>  	int irq;
>  
>  	/* CMDQV Hardware Params */
>  	u16 num_vintfs;
>  	u16 num_vcmdqs;
>  	u16 num_lvcmdqs_per_vintf;
> +	u16 num_sids_per_vintf;
>  
>  	struct ida vintf_ids;
>  
> @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
>  
>  /* HW Reset Functions */
>  
> +/*
> + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> + * to the next VM, which will be a false alarm potentially causing some unwanted
> + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> + */
> +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> +{
> +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> +
> +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> +
> +	/*
> +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> +	 */
> +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> +}

If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
without the Tegra241 CMDQV extension?)

so.. basically on every VM switch, there would be an additional CMD_SYNC
issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
it's completion?

> +
>  /* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */
>  static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  {
> @@ -364,6 +430,8 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR)),
>  			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, CONS)));
>  	}
> +	tegra241_vcmdq_hw_flush_timeout(vcmdq);
> +
>  	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, PROD));
>  	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, CONS));
>  	writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE));
> @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
>  }
>  
> +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> +{
> +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> +}
> +

Not sure why we broke this off to a function, will there be more stuff
here or is this just to use it in tegra241_vcmdq_hw_init_user as well?

>  /* This function is for LVCMDQ, so @vcmdq must be mapped prior */
>  static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
>  {
> @@ -390,7 +464,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
>  	tegra241_vcmdq_hw_deinit(vcmdq);
>  
>  	/* Configure and enable VCMDQ */
> -	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> +	_tegra241_vcmdq_hw_init(vcmdq);
>  
>  	ret = vcmdq_write_config(vcmdq, VCMDQ_EN);
>  	if (ret) {
> @@ -420,6 +494,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq)
>  static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
>  {
>  	u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
> +	int sidx;
>  
>  	/* HW requires to unmap LVCMDQs in descending order */
>  	while (lidx--) {
> @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
>  		}
>  	}
>  	vintf_write_config(vintf, 0);
> +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> +	}
>  }

I'm assuming we call the de-init while switching VMs and hence we need
to clear these to avoid spurious SID replacements in the new VM? Or do
they not reset to 0 when the HW is reset?

>  
>  /* Map a global VCMDQ to the pre-assigned LVCMDQ */
> @@ -457,7 +536,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
>  	 * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a
>  	 * restricted set of supported commands.
>  	 */
> -	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own);
> +	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) |
> +		 FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid);
>  	writel(regval, REG_VINTF(vintf, CONFIG));
>  
>  	ret = vintf_write_config(vintf, regval | VINTF_EN);
> @@ -584,7 +664,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
>  
>  	dev_dbg(vintf->cmdqv->dev,
>  		"%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64));
> -	kfree(vcmdq);
> +	/* Guest-owned VCMDQ is free-ed with hw_queue by iommufd core */
> +	if (vcmdq->vintf->hyp_own)
> +		kfree(vcmdq);
>  }
>  
>  static struct tegra241_vcmdq *
> @@ -671,7 +753,13 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
>  
>  	dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx);
>  	tegra241_cmdqv_deinit_vintf(cmdqv, idx);
> -	kfree(vintf);
> +	if (!vintf->hyp_own) {
> +		mutex_destroy(&vintf->lvcmdq_mutex);
> +		ida_destroy(&vintf->sids);
> +		/* Guest-owned VINTF is free-ed with viommu by iommufd core */
> +	} else {
> +		kfree(vintf);
> +	}
>  }
>  
>  static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
> @@ -699,10 +787,45 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
>  	put_device(cmdqv->dev); /* smmu->impl_dev */
>  }
>  
> +static int
> +tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
> +			       const struct iommu_user_data *user_data);
> +
> +static void *tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *length,
> +				    enum iommu_hw_info_type *type)
> +{
> +	struct tegra241_cmdqv *cmdqv =
> +		container_of(smmu, struct tegra241_cmdqv, smmu);
> +	struct iommu_hw_info_tegra241_cmdqv *info;
> +	u32 regval;
> +
> +	if (*type != IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return ERR_PTR(-ENOMEM);
> +
> +	regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM));
> +	info->log2vcmdqs = ilog2(cmdqv->num_lvcmdqs_per_vintf);
> +	info->log2vsids = ilog2(cmdqv->num_sids_per_vintf);
> +	info->version = FIELD_GET(CMDQV_VER, regval);
> +
> +	*length = sizeof(*info);
> +	*type = IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV;
> +	return info;
> +}
> +
>  static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = {
> +	/* For in-kernel use */
>  	.get_secondary_cmdq = tegra241_cmdqv_get_cmdq,
>  	.device_reset = tegra241_cmdqv_hw_reset,
>  	.device_remove = tegra241_cmdqv_remove,
> +	/* For user-space use */
> +	.hw_info = tegra241_cmdqv_hw_info,
> +	.vsmmu_size = VIOMMU_STRUCT_SIZE(struct tegra241_vintf, vsmmu.core),
> +	.vsmmu_type = IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +	.vsmmu_init = tegra241_cmdqv_init_vintf_user,
>  };
>  
>  /* Probe Functions */
> @@ -844,6 +967,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
>  	cmdqv->irq = irq;
>  	cmdqv->base = base;
>  	cmdqv->dev = smmu->impl_dev;
> +	cmdqv->base_phys = res->start;
>  
>  	if (cmdqv->irq > 0) {
>  		ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr,
> @@ -860,6 +984,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
>  	cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval);
>  	cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval);
>  	cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs;
> +	cmdqv->num_sids_per_vintf =
> +		1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
>  
>  	cmdqv->vintfs =
>  		kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL);
> @@ -913,3 +1039,276 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
>  	put_device(smmu->impl_dev);
>  	return ERR_PTR(-ENODEV);
>  }
> +
> +/* User space VINTF and VCMDQ Functions */
> +
> +static size_t tegra241_vintf_get_vcmdq_size(struct iommufd_viommu *viommu,
> +					    enum iommu_hw_queue_type queue_type)
> +{
> +	if (queue_type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
> +		return 0;
> +	return HW_QUEUE_STRUCT_SIZE(struct tegra241_vcmdq, core);
> +}
> +
> +static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
> +{
> +	char header[64];
> +
> +	/* Configure the vcmdq only; User space does the enabling */
> +	_tegra241_vcmdq_hw_init(vcmdq);
> +
> +	dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n",
> +		lvcmdq_error_header(vcmdq, header, 64),
> +		vcmdq->cmdq.q.q_base & VCMDQ_ADDR,
> +		1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE));
> +	return 0;
> +}
> +
> +static void
> +tegra241_vintf_destroy_lvcmdq_user(struct iommufd_hw_queue *hw_queue)
> +{
> +	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
> +
> +	mutex_lock(&vcmdq->vintf->lvcmdq_mutex);
> +	tegra241_vcmdq_hw_deinit(vcmdq);
> +	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
> +	tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx);
> +	if (vcmdq->prev)
> +		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
> +	mutex_unlock(&vcmdq->vintf->lvcmdq_mutex);
> +}
> +
> +static int tegra241_vintf_alloc_lvcmdq_user(struct iommufd_hw_queue *hw_queue,
> +					    u32 lidx, phys_addr_t base_addr_pa)
> +{
> +	struct tegra241_vintf *vintf = viommu_to_vintf(hw_queue->viommu);
> +	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
> +	struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
> +	struct arm_smmu_device *smmu = &cmdqv->smmu;
> +	struct tegra241_vcmdq *prev = NULL;
> +	u32 log2size, max_n_shift;
> +	char header[64];
> +	int ret;
> +
> +	if (hw_queue->type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
> +		return -EOPNOTSUPP;
> +	if (lidx >= cmdqv->num_lvcmdqs_per_vintf)
> +		return -EINVAL;
> +
> +	mutex_lock(&vintf->lvcmdq_mutex);
> +
> +	if (vintf->lvcmdqs[lidx]) {
> +		ret = -EEXIST;
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * HW requires to map LVCMDQs in ascending order, so reject if the
> +	 * previous lvcmdqs is not allocated yet.
> +	 */
> +	if (lidx) {
> +		prev = vintf->lvcmdqs[lidx - 1];
> +		if (!prev) {
> +			ret = -EIO;
> +			goto unlock;
> +		}
> +	}
> +
> +	/*
> +	 * hw_queue->length must be a power of 2, in range of
> +	 *   [ 32, 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
> +	 */
> +	max_n_shift = FIELD_GET(IDR1_CMDQS,
> +				readl_relaxed(smmu->base + ARM_SMMU_IDR1));
> +	if (!is_power_of_2(hw_queue->length) || hw_queue->length < 32 ||
> +	    hw_queue->length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +	log2size = ilog2(hw_queue->length) - CMDQ_ENT_SZ_SHIFT;
> +
> +	/* base_addr_pa must be aligned to hw_queue->length */
> +	if (base_addr_pa & ~VCMDQ_ADDR ||
> +	    base_addr_pa & (hw_queue->length - 1)) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * HW requires to unmap LVCMDQs in descending order, so destroy() must
> +	 * follow this rule. Set a dependency on its previous LVCMDQ so iommufd
> +	 * core will help enforce it.
> +	 */
> +	if (prev) {
> +		ret = iommufd_hw_queue_depend(vcmdq, prev, core);
> +		if (ret)
> +			goto unlock;
> +	}
> +	vcmdq->prev = prev;
> +
> +	ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
> +	if (ret)
> +		goto undepend_vcmdq;
> +
> +	dev_dbg(cmdqv->dev, "%sallocated\n",
> +		lvcmdq_error_header(vcmdq, header, 64));
> +
> +	tegra241_vcmdq_map_lvcmdq(vcmdq);
> +
> +	vcmdq->cmdq.q.q_base = base_addr_pa & VCMDQ_ADDR;
> +	vcmdq->cmdq.q.q_base |= log2size;
> +
> +	ret = tegra241_vcmdq_hw_init_user(vcmdq);
> +	if (ret)
> +		goto unmap_lvcmdq;
> +
> +	hw_queue->destroy = &tegra241_vintf_destroy_lvcmdq_user;
> +	mutex_unlock(&vintf->lvcmdq_mutex);
> +	return 0;
> +
> +unmap_lvcmdq:
> +	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
> +	tegra241_vintf_deinit_lvcmdq(vintf, lidx);
> +undepend_vcmdq:
> +	if (vcmdq->prev)
> +		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
> +unlock:
> +	mutex_unlock(&vintf->lvcmdq_mutex);
> +	return ret;
> +}
> +
> +static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu)
> +{
> +	struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
> +
> +	if (vintf->mmap_offset)
> +		iommufd_viommu_destroy_mmap(&vintf->vsmmu.core,
> +					    vintf->mmap_offset);
> +	tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx);
> +}
> +
> +static void tegra241_vintf_destroy_vsid(struct iommufd_vdevice *vdev)
> +{
> +	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
> +	struct tegra241_vintf *vintf = vsid->vintf;
> +
> +	writel(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));
> +	writel(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
> +	ida_free(&vintf->sids, vsid->idx);
> +	dev_dbg(vintf->cmdqv->dev,
> +		"VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx,
> +		vsid->idx, vsid->sid);
> +}
> +
> +static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(vdev->dev);
> +	struct tegra241_vintf *vintf = viommu_to_vintf(vdev->viommu);
> +	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
> +	struct arm_smmu_stream *stream = &master->streams[0];
> +	u64 virt_sid = vdev->virt_id;
> +	int sidx;
> +
> +	if (virt_sid > UINT_MAX)
> +		return -EINVAL;
> +
> +	WARN_ON_ONCE(master->num_streams != 1);
> +
> +	/* Find an empty pair of SID_REPLACE and SID_MATCH */
> +	sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1,
> +			     GFP_KERNEL);
> +	if (sidx < 0)
> +		return sidx;
> +
> +	writel(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx)));
> +	writel(virt_sid << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx)));
> +	dev_dbg(vintf->cmdqv->dev,
> +		"VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n",
> +		vintf->idx, sidx, stream->id, (u32)virt_sid);
> +
> +	vsid->idx = sidx;
> +	vsid->vintf = vintf;
> +	vsid->sid = stream->id;
> +
> +	vdev->destroy = &tegra241_vintf_destroy_vsid;
> +	return 0;
> +}
> +
> +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> +	.cache_invalidate = arm_vsmmu_cache_invalidate,

I see that we currently use the main cmdq to issue these cache
invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
hoping for this series to change that but I'm assuming there's another
series coming for that?

Meanwhile, I guess it'd be good to call that out for folks who have
Grace and start trying out this feature.. I'm assuming they won't see
as much perf improvement with this series alone since we're still using
the main CMDQ in the upstream code?

> +	.vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core),
> +	.vdevice_init = tegra241_vintf_init_vsid,
> +	.get_hw_queue_size = tegra241_vintf_get_vcmdq_size,
> +	.hw_queue_init_phys = tegra241_vintf_alloc_lvcmdq_user,
> +};
> +
> +static int
> +tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
> +			       const struct iommu_user_data *user_data)
> +{
> +	struct tegra241_cmdqv *cmdqv =
> +		container_of(vsmmu->smmu, struct tegra241_cmdqv, smmu);
> +	struct tegra241_vintf *vintf = viommu_to_vintf(&vsmmu->core);
> +	struct iommu_viommu_tegra241_cmdqv data;
> +	phys_addr_t page0_base;
> +	int ret;
> +
> +	if (!user_data)
> +		return -EINVAL;
> +
> +	ret = iommu_copy_struct_from_user(&data, user_data,
> +					  IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +					  out_vintf_mmap_length);
> +	if (ret)
> +		return ret;
> +
> +	ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf);
> +	if (ret < 0) {
> +		dev_err(cmdqv->dev, "no more available vintf\n");
> +		return ret;
> +	}
> +
> +	/*
> +	 * Initialize the user-owned VINTF without a LVCMDQ, as it cannot pre-
> +	 * allocate a LVCMDQ until user space wants one, for security reasons.
> +	 * It is different than the kernel-owned VINTF0, which had pre-assigned
> +	 * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs
> +	 * by the tegra241_vintf_hw_init() call.
> +	 */
> +	ret = tegra241_vintf_hw_init(vintf, false);
> +	if (ret)
> +		goto deinit_vintf;
> +
> +	page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx);
> +	ret = iommufd_viommu_alloc_mmap(&vintf->vsmmu.core, page0_base, SZ_64K,
> +					&vintf->mmap_offset);
> +	if (ret)
> +		goto hw_deinit_vintf;
> +
> +	data.out_vintf_mmap_length = SZ_64K;
> +	data.out_vintf_mmap_offset = vintf->mmap_offset;
> +	ret = iommu_copy_struct_to_user(user_data, &data,
> +					IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +					out_vintf_mmap_length);
> +	if (ret)
> +		goto free_mmap;
> +
> +	ida_init(&vintf->sids);
> +	mutex_init(&vintf->lvcmdq_mutex);
> +
> +	dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx,
> +		vintf->vsmmu.vmid);
> +
> +	vsmmu->core.ops = &tegra241_cmdqv_viommu_ops;
> +	return 0;
> +
> +free_mmap:
> +	iommufd_viommu_destroy_mmap(&vintf->vsmmu.core, vintf->mmap_offset);
> +hw_deinit_vintf:
> +	tegra241_vintf_hw_deinit(vintf);
> +deinit_vintf:
> +	tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx);
> +	return ret;
> +}

Thanks,
Praan

> -- 
> 2.43.0
>

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> >  /**
> >   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> >   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> > @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
> >   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
> >   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> >   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> > + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> > + *                                     SMMUv3) info type
> 
> I know that the goal here is to mention that Tegra241 CMDQV is an
> extension for Arm SMMUv3, but this comment could be misunderstood as the
> "type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we

IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV only reports CMDQV structure.
VMM still needs to poll the IOMMU_HW_INFO_TYPE_ARM_SMMUV3. It's
basically working as "type being an extension".

> Sorry to be nit-picky here, I know that the code is clear, but I've seen
> people don't care to read more than the uapi descriptions. Maybe we
> could re-write this comment, here and everywhere else?

I can change this thought:

+ * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                    SMMUv3) enabled ARM SMMUv3 type

> > +/**
> > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > + * @vintf: Parent VINTF pointer
> > + * @sid: Physical Stream ID
> > + * @idx: Replacement index in the VINTF
> > + */
> > +struct tegra241_vintf_sid {
> > +	struct iommufd_vdevice core;
> > +	struct tegra241_vintf *vintf;
> > +	u32 sid;
> > +	u8 idx;
> >  };
> 
> AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?

No. It's for vSID to pSID mappings. I had it explained in commit log:

For ATC invalidation commands that hold an SID, it requires all devices to
register their virtual SIDs to the SID_MATCH registers and their physical
SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
lookup table to replace those virtual SIDs with the correct physical SIDs.
Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
structure to allocate SID_REPLACE and to program the SIDs accordingly.

> > @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
> >  
> >  /* HW Reset Functions */
> >  
> > +/*
> > + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> > + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> > + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> > + * to the next VM, which will be a false alarm potentially causing some unwanted
> > + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> > + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> > + */
> > +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> > +{
> > +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> > +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> > +
> > +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> > +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> > +
> > +	/*
> > +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> > +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> > +	 */
> > +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> > +}
> 
> If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
> the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
> without the Tegra241 CMDQV extension?)
>
> so.. basically on every VM switch, there would be an additional CMD_SYNC
> issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
> it's completion?

The main CMDQ exists regardless whether CMDQV extension is there or
not. The CMD_SYNC can be issued to any (v)CMDQ. The smmu->cmdq is
just the easiest one to use here.

> > @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
> >  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
> >  }
> >  
> > +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> > +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> > +{
> > +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> > +}
> > +
> 
> Not sure why we broke this off to a function, will there be more stuff
> here or is this just to use it in tegra241_vcmdq_hw_init_user as well?

I can take it off.

> > @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
> >  		}
> >  	}
> >  	vintf_write_config(vintf, 0);
> > +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> > +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> > +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> > +	}
> >  }
> 
> I'm assuming we call the de-init while switching VMs and hence we need
> to clear these to avoid spurious SID replacements in the new VM? Or do
> they not reset to 0 when the HW is reset?

The driver does not reset HW when tearing down a VM, but only sets
VINTF's enable bit to 0. So, it should just set other configuration
bits to 0 as well.

> > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> 
> I see that we currently use the main cmdq to issue these cache
> invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> hoping for this series to change that but I'm assuming there's another
> series coming for that?
> 
> Meanwhile, I guess it'd be good to call that out for folks who have
> Grace and start trying out this feature.. I'm assuming they won't see
> as much perf improvement with this series alone since we're still using
> the main CMDQ in the upstream code?

VCMDQ only accelerates invalidation commands.

That is for non-invalidation commands that VCMDQ doesn't support,
so they still have to go in the standard nesting pathway.

Let's add a line:
	/* for non-invalidation commands use */

Nicolin

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > >  /**
> > >   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> > >   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> > > @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
> > >   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
> > >   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> > >   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> > > + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> > > + *                                     SMMUv3) info type
> > 
> > I know that the goal here is to mention that Tegra241 CMDQV is an
> > extension for Arm SMMUv3, but this comment could be misunderstood as the
> > "type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we
> 
> IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV only reports CMDQV structure.
> VMM still needs to poll the IOMMU_HW_INFO_TYPE_ARM_SMMUV3. It's
> basically working as "type being an extension".
> 

Ohh okay, I see.. I thought we were describing the HW.

> > Sorry to be nit-picky here, I know that the code is clear, but I've seen
> > people don't care to read more than the uapi descriptions. Maybe we
> > could re-write this comment, here and everywhere else?
> 
> I can change this thought:
> 
> + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                    SMMUv3) enabled ARM SMMUv3 type
> 

Yes, that helps, thanks!

> > > +/**
> > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > + * @vintf: Parent VINTF pointer
> > > + * @sid: Physical Stream ID
> > > + * @idx: Replacement index in the VINTF
> > > + */
> > > +struct tegra241_vintf_sid {
> > > +	struct iommufd_vdevice core;
> > > +	struct tegra241_vintf *vintf;
> > > +	u32 sid;
> > > +	u8 idx;
> > >  };
> > 
> > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> 
> No. It's for vSID to pSID mappings. I had it explained in commit log:
> 

I get that, it's for vSID -> pSID mapping which also "happens to" point
to the vintf.. all I wanted to say was that the description is unclear..
We could've described it as "Vintf SID map" or something, but I guess
it's fine the way it is too.. your call.

> For ATC invalidation commands that hold an SID, it requires all devices to
> register their virtual SIDs to the SID_MATCH registers and their physical
> SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
> lookup table to replace those virtual SIDs with the correct physical SIDs.
> Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
> structure to allocate SID_REPLACE and to program the SIDs accordingly.
> 
> > > @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
> > >  
> > >  /* HW Reset Functions */
> > >  
> > > +/*
> > > + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> > > + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> > > + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> > > + * to the next VM, which will be a false alarm potentially causing some unwanted
> > > + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> > > + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> > > + */
> > > +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> > > +{
> > > +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> > > +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> > > +
> > > +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> > > +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> > > +
> > > +	/*
> > > +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> > > +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> > > +	 */
> > > +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> > > +}
> > 
> > If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
> > the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
> > without the Tegra241 CMDQV extension?)
> >
> > so.. basically on every VM switch, there would be an additional CMD_SYNC
> > issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
> > it's completion?
> 
> The main CMDQ exists regardless whether CMDQV extension is there or
> not. The CMD_SYNC can be issued to any (v)CMDQ. The smmu->cmdq is
> just the easiest one to use here.
> 

I see. Thanks!

> > > @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
> > >  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
> > >  }
> > >  
> > > +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> > > +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> > > +{
> > > +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> > > +}
> > > +
> > 
> > Not sure why we broke this off to a function, will there be more stuff
> > here or is this just to use it in tegra241_vcmdq_hw_init_user as well?
> 
> I can take it off.
> 

Nah, that's okay, I was just curious.

> > > @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
> > >  		}
> > >  	}
> > >  	vintf_write_config(vintf, 0);
> > > +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> > > +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> > > +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> > > +	}
> > >  }
> > 
> > I'm assuming we call the de-init while switching VMs and hence we need
> > to clear these to avoid spurious SID replacements in the new VM? Or do
> > they not reset to 0 when the HW is reset?
> 
> The driver does not reset HW when tearing down a VM, but only sets
> VINTF's enable bit to 0. So, it should just set other configuration
> bits to 0 as well.
> 
> > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > 
> > I see that we currently use the main cmdq to issue these cache
> > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > hoping for this series to change that but I'm assuming there's another
> > series coming for that?
> > 
> > Meanwhile, I guess it'd be good to call that out for folks who have
> > Grace and start trying out this feature.. I'm assuming they won't see
> > as much perf improvement with this series alone since we're still using
> > the main CMDQ in the upstream code?
> 
> VCMDQ only accelerates invalidation commands.
> 

I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
from arm-smmu-v3-iommufd.c which seems to issue all commands to
smmu->cmdq as of now (the code has a FIXME as well), per the code:

	/* FIXME always uses the main cmdq rather than trying to group by type */
        ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
					  cur - last, true);

I was hoping this FIXME to be addressed in this series..

> That is for non-invalidation commands that VCMDQ doesn't support,
> so they still have to go in the standard nesting pathway.
> 
> Let's add a line:
> 	/* for non-invalidation commands use */

Umm.. I was talking about the cache_invalidate op? I think there's some
misunderstanding here? What am I missing?

> 
> Nicolin

Thanks
Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Jul 01, 2025 at 08:03:35PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> > On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > > > +/**
> > > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > > + * @vintf: Parent VINTF pointer
> > > > + * @sid: Physical Stream ID
> > > > + * @idx: Replacement index in the VINTF
> > > > + */
> > > > +struct tegra241_vintf_sid {
> > > > +	struct iommufd_vdevice core;
> > > > +	struct tegra241_vintf *vintf;
> > > > +	u32 sid;
> > > > +	u8 idx;
> > > >  };
> > > 
> > > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> > 
> > No. It's for vSID to pSID mappings. I had it explained in commit log:
> > 
> 
> I get that, it's for vSID -> pSID mapping which also "happens to" point
> to the vintf.. all I wanted to say was that the description is unclear..
> We could've described it as "Vintf SID map" or something, but I guess
> it's fine the way it is too.. your call.

The "replace" word is borrowed from the "SID_REPLACE" HW register.

But I think it's okay to call it just "mapping", if that makes it
clearer.

> > > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > > 
> > > I see that we currently use the main cmdq to issue these cache
> > > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > > hoping for this series to change that but I'm assuming there's another
> > > series coming for that?
> > > 
> > > Meanwhile, I guess it'd be good to call that out for folks who have
> > > Grace and start trying out this feature.. I'm assuming they won't see
> > > as much perf improvement with this series alone since we're still using
> > > the main CMDQ in the upstream code?
> > 
> > VCMDQ only accelerates invalidation commands.
> > 
> 
> I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
> from arm-smmu-v3-iommufd.c which seems to issue all commands to
> smmu->cmdq as of now (the code has a FIXME as well), per the code:
> 
> 	/* FIXME always uses the main cmdq rather than trying to group by type */
>         ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
> 					  cur - last, true);
> 
> I was hoping this FIXME to be addressed in this series..

Oh, that's not related.

The main goal of this series is to route all invalidation commands
to the VCMDQ HW. And this is where Grace users can see perf gains
mentioned in the cover letter or commit log, from eliminating the
VM Exits at those most frequently used commands.

Any non-invalidation commands will just reuse what we have with the
standard SMMU nesting. And even if we did something to that FIXME,
there is no significant perf gain as it's going down the trapping
pathway, i.e. the VM Exits are always there.

> > That is for non-invalidation commands that VCMDQ doesn't support,
> > so they still have to go in the standard nesting pathway.
> > 
> > Let's add a line:
> > 	/* for non-invalidation commands use */
> 
> Umm.. I was talking about the cache_invalidate op? I think there's some
> misunderstanding here? What am I missing?

That line is exactly for cache_invalidate. All the non-invalidation
commands will be sent to he arm_vsmmu_cache_invalidate() by VMM, as
it means.

Or perhaps calling them "non-accelerated commands" would be nicer.

Thanks
Nicolin

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 08:03:35PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> > > On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > > > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > > > > +/**
> > > > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > > > + * @vintf: Parent VINTF pointer
> > > > > + * @sid: Physical Stream ID
> > > > > + * @idx: Replacement index in the VINTF
> > > > > + */
> > > > > +struct tegra241_vintf_sid {
> > > > > +	struct iommufd_vdevice core;
> > > > > +	struct tegra241_vintf *vintf;
> > > > > +	u32 sid;
> > > > > +	u8 idx;
> > > > >  };
> > > > 
> > > > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > > > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> > > 
> > > No. It's for vSID to pSID mappings. I had it explained in commit log:
> > > 
> > 
> > I get that, it's for vSID -> pSID mapping which also "happens to" point
> > to the vintf.. all I wanted to say was that the description is unclear..
> > We could've described it as "Vintf SID map" or something, but I guess
> > it's fine the way it is too.. your call.
> 
> The "replace" word is borrowed from the "SID_REPLACE" HW register.
> 
> But I think it's okay to call it just "mapping", if that makes it
> clearer.
> 

Anything works. Maybe let it be as is.

> > > > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > > > 
> > > > I see that we currently use the main cmdq to issue these cache
> > > > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > > > hoping for this series to change that but I'm assuming there's another
> > > > series coming for that?
> > > > 
> > > > Meanwhile, I guess it'd be good to call that out for folks who have
> > > > Grace and start trying out this feature.. I'm assuming they won't see
> > > > as much perf improvement with this series alone since we're still using
> > > > the main CMDQ in the upstream code?
> > > 
> > > VCMDQ only accelerates invalidation commands.
> > > 
> > 
> > I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
> > from arm-smmu-v3-iommufd.c which seems to issue all commands to
> > smmu->cmdq as of now (the code has a FIXME as well), per the code:
> > 
> > 	/* FIXME always uses the main cmdq rather than trying to group by type */
> >         ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
> > 					  cur - last, true);
> > 
> > I was hoping this FIXME to be addressed in this series..
> 
> Oh, that's not related.
> 
> The main goal of this series is to route all invalidation commands
> to the VCMDQ HW. And this is where Grace users can see perf gains
> mentioned in the cover letter or commit log, from eliminating the
> VM Exits at those most frequently used commands.
> 
> Any non-invalidation commands will just reuse what we have with the
> standard SMMU nesting. And even if we did something to that FIXME,
> there is no significant perf gain as it's going down the trapping
> pathway, i.e. the VM Exits are always there.
> 
> > > That is for non-invalidation commands that VCMDQ doesn't support,
> > > so they still have to go in the standard nesting pathway.
> > > 
> > > Let's add a line:
> > > 	/* for non-invalidation commands use */
> > 
> > Umm.. I was talking about the cache_invalidate op? I think there's some
> > misunderstanding here? What am I missing?
> 
> That line is exactly for cache_invalidate. All the non-invalidation
> commands will be sent to he arm_vsmmu_cache_invalidate() by VMM, as
> it means.
> 
> Or perhaps calling them "non-accelerated commands" would be nicer.

Uhh okay, so there'll be a separate driver in the VM issuing invalidation
commands directly to the CMDQV thus we don't see any of it's part here?

AND for non-invalidation commands, we trap out and the VMM ends up
calling the `cache_invalidate` op of the viommu?

Is that understanding correct?

> 
> Thanks
> Nicolin

Thanks
Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > Or perhaps calling them "non-accelerated commands" would be nicer.
> 
> Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> commands directly to the CMDQV thus we don't see any of it's part here?

That's how it works. VM must run a guest-level VCMDQ driver that
separates accelerated and non-accelerated commands as it already
does:

    accelerated commands => VCMDQ (HW)
non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)

Nicolin

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > 
> > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > commands directly to the CMDQV thus we don't see any of it's part here?
> 
> That's how it works. VM must run a guest-level VCMDQ driver that
> separates accelerated and non-accelerated commands as it already
> does:
> 
>     accelerated commands => VCMDQ (HW)
> non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> 

Right exactly what got me confused. I was assuming the same CMDQV driver
would run in the Guest kernel but seems like there's another driver for
the Guest that's not in tree yet or maybe is a purely user-space thing?

And the weird part was that "invalidation" commands are accelerated but
we use the .cache_invalidate viommu op for `non-invalidation` commands.
But I guess what you meant there could be non-accelerated invalidation 
commands (maybe something stage 2 TLBIs?) which would go through the 
.cache_invalidate op, right?

> Nicolin

Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Jul 01, 2025 at 10:51:20PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> > On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > > 
> > > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > > commands directly to the CMDQV thus we don't see any of it's part here?
> > 
> > That's how it works. VM must run a guest-level VCMDQ driver that
> > separates accelerated and non-accelerated commands as it already
> > does:
> > 
> >     accelerated commands => VCMDQ (HW)
> > non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> > 
> 
> Right exactly what got me confused. I was assuming the same CMDQV driver
> would run in the Guest kernel but seems like there's another driver for
> the Guest that's not in tree yet or maybe is a purely user-space thing?

It's the same tegra241-cmdqv.c in the kernel, which is already
a part of mainline Linux. Both host and guest run the same copy
of software. The host kernel just has the user VINTF part (via
iommufd) additional to what the guest already has.

> And the weird part was that "invalidation" commands are accelerated but
> we use the .cache_invalidate viommu op for `non-invalidation` commands.
> But I guess what you meant there could be non-accelerated invalidation 
> commands (maybe something stage 2 TLBIs?) which would go through the 
> .cache_invalidate op, right?

I am talking about this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c?h=v6.16-rc4#n305

Those commands returned "false" will be issued to smmu->cmdq in a
guest VM, which will be trapped by VMM as a standard SMMU nesting
and will be further forwarded via iommufd to the host kernel that
will invoke this cache_invalidate op in the arm-smmu-v3 driver.

Those commands returned "true" will be issued to vcmdq->cmdq that
is HW-accelerated queue (setup by VMM via iommufd's hw_queue/mmap).

Nicolin

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Tue, Jul 01, 2025 at 04:01:34PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 10:51:20PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> > > On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > > > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > > > 
> > > > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > > > commands directly to the CMDQV thus we don't see any of it's part here?
> > > 
> > > That's how it works. VM must run a guest-level VCMDQ driver that
> > > separates accelerated and non-accelerated commands as it already
> > > does:
> > > 
> > >     accelerated commands => VCMDQ (HW)
> > > non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> > > 
> > 
> > Right exactly what got me confused. I was assuming the same CMDQV driver
> > would run in the Guest kernel but seems like there's another driver for
> > the Guest that's not in tree yet or maybe is a purely user-space thing?
> 
> It's the same tegra241-cmdqv.c in the kernel, which is already
> a part of mainline Linux. Both host and guest run the same copy
> of software. The host kernel just has the user VINTF part (via
> iommufd) additional to what the guest already has.
> 
> > And the weird part was that "invalidation" commands are accelerated but
> > we use the .cache_invalidate viommu op for `non-invalidation` commands.
> > But I guess what you meant there could be non-accelerated invalidation 
> > commands (maybe something stage 2 TLBIs?) which would go through the 
> > .cache_invalidate op, right?
> 
> I am talking about this:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c?h=v6.16-rc4#n305
> 
> Those commands returned "false" will be issued to smmu->cmdq in a
> guest VM, which will be trapped by VMM as a standard SMMU nesting
> and will be further forwarded via iommufd to the host kernel that
> will invoke this cache_invalidate op in the arm-smmu-v3 driver.
> 
> Those commands returned "true" will be issued to vcmdq->cmdq that
> is HW-accelerated queue (setup by VMM via iommufd's hw_queue/mmap).

Right, this brings me back to my original understanding, the arm-smmu-v3
driver checks for "supported commands" and figures out which queue shall
they be issued to.. now there are commands which are "non-invalidation"
commands which are non-acclerated like CMD_PRI_RESP, which would be
issued through the trap => .cache_invalidate path. 

Thus, coming back to the two initial points:

1) Issuing "non-invalidation" commands through .cache_invalidate could
   be confusing, I'm not asking to change the op name here, but if we
   plan to label it, let's label them as "Trapped commands" OR
   "non-accelerated" commands as you suggested.

2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
   mentions we'd like to "fix" the issuing of commands through the main
   cmdq and instead like to group by "type", if that "type" is the queue
   type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
   what do we plan to do differently there, given that the op is only
   for trapped commands *have* to go through the main CMDQ?

   If we were planning to do something based on the queue type, I was
   hoping for it to be addressed in this series as we've introduced the
   Tegra CMDQ type. 

That's all I wanted to say, sorry for if this was confusing.

> 
> Nicolin

Thanks,
Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Nicolin Chen 3 months, 1 week ago

On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> Thus, coming back to the two initial points:
> 
> 1) Issuing "non-invalidation" commands through .cache_invalidate could
>    be confusing, I'm not asking to change the op name here, but if we
>    plan to label it, let's label them as "Trapped commands" OR
>    "non-accelerated" commands as you suggested.

VCMDQ only accelerates limited invalidation commands, not all of
them: STE cache invalidation and CD cache invalidation commands
still go down to that op.

> 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
>    mentions we'd like to "fix" the issuing of commands through the main
>    cmdq and instead like to group by "type", if that "type" is the queue
>    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),

I recall that FIXME is noted by Jason at that time. And it should
be interpreted as "group by opcode", IIUIC.

The thing is that for a host kernel that enabled in-kernel VCMDQs,
those trapped user commands can be just issued to the smmu->cmdq
or a vcmdq (picked via the get_secondary_cmdq impl_op).

>    what do we plan to do differently there, given that the op is only
>    for trapped commands *have* to go through the main CMDQ?

If we do something differently there, it could just do a one-time
get_secondary_cmdq call to pick a in-kernel vcmdq over smmu->cmdq
to fill in all the trapped commands.

And this is not related to this series at all.

Nicolin

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months, 1 week ago

On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > Thus, coming back to the two initial points:
> > 
> > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> >    be confusing, I'm not asking to change the op name here, but if we
> >    plan to label it, let's label them as "Trapped commands" OR
> >    "non-accelerated" commands as you suggested.
> 
> VCMDQ only accelerates limited invalidation commands, not all of
> them: STE cache invalidation and CD cache invalidation commands
> still go down to that op.
> 

Right, I'm just saying the "other" non-accelerated commands that are
NOT invalidations also go down that op. So, if we add a comment, let's 
not call them "non-invalidation" commands.

> > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> >    mentions we'd like to "fix" the issuing of commands through the main
> >    cmdq and instead like to group by "type", if that "type" is the queue
> >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> 
> I recall that FIXME is noted by Jason at that time. And it should
> be interpreted as "group by opcode", IIUIC.

I see.. I misunderstood that..

> 
> The thing is that for a host kernel that enabled in-kernel VCMDQs,
> those trapped user commands can be just issued to the smmu->cmdq
> or a vcmdq (picked via the get_secondary_cmdq impl_op).
> 

Ohh.. so maybe some sort of a load balancing thing?

> >    what do we plan to do differently there, given that the op is only
> >    for trapped commands *have* to go through the main CMDQ?
> 
> If we do something differently there, it could just do a one-time
> get_secondary_cmdq call to pick a in-kernel vcmdq over smmu->cmdq
> to fill in all the trapped commands.
> 

Alright.

> And this is not related to this series at all.

Agreed, sorry for the confusion then.. I thought that the "type" meant
the queue type.. I guess it's all done then. I have no further questions

Thanks for the clarification!

> 
> Nicolin

Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Jason Gunthorpe 3 months, 1 week ago

On Wed, Jul 02, 2025 at 01:38:33AM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> > On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > > Thus, coming back to the two initial points:
> > > 
> > > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> > >    be confusing, I'm not asking to change the op name here, but if we
> > >    plan to label it, let's label them as "Trapped commands" OR
> > >    "non-accelerated" commands as you suggested.
> > 
> > VCMDQ only accelerates limited invalidation commands, not all of
> > them: STE cache invalidation and CD cache invalidation commands
> > still go down to that op.
> > 
> 
> Right, I'm just saying the "other" non-accelerated commands that are
> NOT invalidations also go down that op. So, if we add a comment, let's 
> not call them "non-invalidation" commands.

There are no non-invalidation commands:

static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
				      struct arm_vsmmu_invalidation_cmd *cmd)
{
	switch (cmd->cmd[0] & CMDQ_0_OP) {
	case CMDQ_OP_TLBI_NSNH_ALL:
	case CMDQ_OP_TLBI_NH_VA:
	case CMDQ_OP_TLBI_NH_VAA:
	case CMDQ_OP_TLBI_NH_ALL:
	case CMDQ_OP_TLBI_NH_ASID:
	case CMDQ_OP_ATC_INV:
	case CMDQ_OP_CFGI_CD:
	case CMDQ_OP_CFGI_CD_ALL:

Those are only invalidations.

CD invalidation can't go through the vCMDQ path.

> > > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> > >    mentions we'd like to "fix" the issuing of commands through the main
> > >    cmdq and instead like to group by "type", if that "type" is the queue
> > >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> > 
> > I recall that FIXME is noted by Jason at that time. And it should
> > be interpreted as "group by opcode", IIUIC.
> 
> I see.. I misunderstood that..

Yes, we could use the vCMDQ in the SMMU driver for invalidations which
would give some minor locking advantage. But it is not really
important to anyone.
 
> > The thing is that for a host kernel that enabled in-kernel VCMDQs,
> > those trapped user commands can be just issued to the smmu->cmdq
> > or a vcmdq (picked via the get_secondary_cmdq impl_op).
> 
> Ohh.. so maybe some sort of a load balancing thing?

The goal of the SMMU driver when it detects CMDQV support is to route
all supported invalidations to CMDQV queues and then balance those
queues across CPUs to reduce lock contention.

Jason

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months ago

On Wed, Jul 02, 2025 at 03:05:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 02, 2025 at 01:38:33AM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> > > On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > > > Thus, coming back to the two initial points:
> > > > 
> > > > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> > > >    be confusing, I'm not asking to change the op name here, but if we
> > > >    plan to label it, let's label them as "Trapped commands" OR
> > > >    "non-accelerated" commands as you suggested.
> > > 
> > > VCMDQ only accelerates limited invalidation commands, not all of
> > > them: STE cache invalidation and CD cache invalidation commands
> > > still go down to that op.
> > > 
> > 
> > Right, I'm just saying the "other" non-accelerated commands that are
> > NOT invalidations also go down that op. So, if we add a comment, let's 
> > not call them "non-invalidation" commands.
> 
> There are no non-invalidation commands:
> 
> static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
> 				      struct arm_vsmmu_invalidation_cmd *cmd)
> {
> 	switch (cmd->cmd[0] & CMDQ_0_OP) {
> 	case CMDQ_OP_TLBI_NSNH_ALL:
> 	case CMDQ_OP_TLBI_NH_VA:
> 	case CMDQ_OP_TLBI_NH_VAA:
> 	case CMDQ_OP_TLBI_NH_ALL:
> 	case CMDQ_OP_TLBI_NH_ASID:
> 	case CMDQ_OP_ATC_INV:
> 	case CMDQ_OP_CFGI_CD:
> 	case CMDQ_OP_CFGI_CD_ALL:
> 
> Those are only invalidations.
> 
> CD invalidation can't go through the vCMDQ path.
> 

Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
I guess I'll need to look and understand a little more if they are..

> > > > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> > > >    mentions we'd like to "fix" the issuing of commands through the main
> > > >    cmdq and instead like to group by "type", if that "type" is the queue
> > > >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> > > 
> > > I recall that FIXME is noted by Jason at that time. And it should
> > > be interpreted as "group by opcode", IIUIC.
> > 
> > I see.. I misunderstood that..
> 
> Yes, we could use the vCMDQ in the SMMU driver for invalidations which
> would give some minor locking advantage. But it is not really
> important to anyone.
> 

Alright, I see. Makes sense. Thanks for the clarification.

> > > The thing is that for a host kernel that enabled in-kernel VCMDQs,
> > > those trapped user commands can be just issued to the smmu->cmdq
> > > or a vcmdq (picked via the get_secondary_cmdq impl_op).
> > 
> > Ohh.. so maybe some sort of a load balancing thing?
> 
> The goal of the SMMU driver when it detects CMDQV support is to route
> all supported invalidations to CMDQV queues and then balance those
> queues across CPUs to reduce lock contention.
> 

I see.. that makes sense.. so it's a relatively small gain (but a nice
one). Thanks for clarifying!

> Jason

Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Jason Gunthorpe 3 months ago

On Thu, Jul 03, 2025 at 02:46:03PM +0000, Pranjal Shrivastava wrote:

> Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
> and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
> I guess I'll need to look and understand a little more if they are..

Right now these commands are not supported by vSMMUv3 in Linux.

They probably should be trapped, but completing a PRI (or resuming a
stall which we will treat the same) will go through the PRI/page fault
logic in iommufd not the cache invalidate.

> > The goal of the SMMU driver when it detects CMDQV support is to route
> > all supported invalidations to CMDQV queues and then balance those
> > queues across CPUs to reduce lock contention.
> 
> I see.. that makes sense.. so it's a relatively small gain (but a nice
> one). Thanks for clarifying!

On bare metal the gain is small (due to locking and balancing), while
on virtualization the gain is huge (due to no trapping).

Regardless the SMMU driver uses cmdqv support if the HW says it is
there.

Jason

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months ago

On Thu, Jul 03, 2025 at 02:55:32PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 03, 2025 at 02:46:03PM +0000, Pranjal Shrivastava wrote:
> 
> > Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
> > and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
> > I guess I'll need to look and understand a little more if they are..
> 
> Right now these commands are not supported by vSMMUv3 in Linux.
> 
> They probably should be trapped, but completing a PRI (or resuming a
> stall which we will treat the same) will go through the PRI/page fault
> logic in iommufd not the cache invalidate.
> 

Ahh, thanks for this, that saved a lot of my time! And yes, I see some
functions in eventq.c calling the iopf_group_response which settles the
CMD_RESUME. So.. I assume these resume commands would be trapped and
*actually* executed through this or a similar path for vPRI. 

Meh, I had been putting off reading up the fault parts of iommufd, 
I guess I'll go through that too, now :) 

> > > The goal of the SMMU driver when it detects CMDQV support is to route
> > > all supported invalidations to CMDQV queues and then balance those
> > > queues across CPUs to reduce lock contention.
> > 
> > I see.. that makes sense.. so it's a relatively small gain (but a nice
> > one). Thanks for clarifying!
> 
> On bare metal the gain is small (due to locking and balancing), while
> on virtualization the gain is huge (due to no trapping).
> 

Ohh yes, I meant the bare metal gains here.. for virtualization, it's
definitely huge (as reported too).

> Regardless the SMMU driver uses cmdqv support if the HW says it is
> there.
>
> Jason
>

Thanks!
Praan

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Jason Gunthorpe 3 months ago

On Thu, Jul 03, 2025 at 06:48:42PM +0000, Pranjal Shrivastava wrote:

> Ahh, thanks for this, that saved a lot of my time! And yes, I see some
> functions in eventq.c calling the iopf_group_response which settles the
> CMD_RESUME. So.. I assume these resume commands would be trapped and
> *actually* executed through this or a similar path for vPRI. 

Yes, that is what Intel did. PRI has to be tracked in the kernel
because we have to ack requests eventually. If the VMM crashes the
kernel has to ack everything and try to clean up.

Also SMMUv3 does not support PRI today, just stall.

Jason

Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support

Posted by Pranjal Shrivastava 3 months ago

On Fri, Jul 04, 2025 at 09:50:12AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 03, 2025 at 06:48:42PM +0000, Pranjal Shrivastava wrote:
> 
> > Ahh, thanks for this, that saved a lot of my time! And yes, I see some
> > functions in eventq.c calling the iopf_group_response which settles the
> > CMD_RESUME. So.. I assume these resume commands would be trapped and
> > *actually* executed through this or a similar path for vPRI. 
> 
> Yes, that is what Intel did. PRI has to be tracked in the kernel
> because we have to ack requests eventually. If the VMM crashes the
> kernel has to ack everything and try to clean up.
> 

I see.. thanks for clarifying!

> Also SMMUv3 does not support PRI today, just stall.
> 

Ack. Thanks!

> Jason
Praan