While polling for n spaces in the cmdq, the current code instead checks
if the queue is full. If the queue is almost full but not enough space
(<n), then the CMDQ timeout warning is never triggered even if the
polling has exceeded timeout limit.
The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit efficiently
nor ideally to the only caller arm_smmu_cmdq_issue_cmdlist():
- It uses a new timer at every single call, which fails to limit to the
preset ARM_SMMU_POLL_TIMEOUT_US per issue.
- It has a redundant internal queue_full(), which doesn't detect whether
there is a enough space for number of n commands.
This patch polls for the availability of exact space instead of full and
emit timeout warning accordingly.
Fixes: 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion")
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v5:
- Disable WFE for queue space polling (Robin, Will)
v4:
- Deleted non-ETIMEOUT error handling for queue_poll (Nicolin)
v3:
- Use a helper for cmdq poll instead of open coding (Nicolin)
- Add more explanation in the commit message (Nicolin)
v2: - Reduced debug print info (Nicolin)
- Use a separate irq flags for exclusive lock
- Handle queue_poll err code other than ETIMEOUT
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 49 ++++++++++-----------
1 file changed, 24 insertions(+), 25 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d637a5dcf48a..3467c10be0d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -117,12 +117,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
return space >= n;
}
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
- return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
- Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
static bool queue_empty(struct arm_smmu_ll_queue *q)
{
return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -612,14 +606,13 @@ static void arm_smmu_cmdq_poll_valid_map(struct arm_smmu_cmdq *cmdq,
__arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod, eprod, false);
}
-/* Wait for the command queue to become non-full */
-static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
- struct arm_smmu_cmdq *cmdq,
- struct arm_smmu_ll_queue *llq)
+
+static inline void arm_smmu_cmdq_poll(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq,
+ struct arm_smmu_ll_queue *llq,
+ struct arm_smmu_queue_poll *qp)
{
unsigned long flags;
- struct arm_smmu_queue_poll qp;
- int ret = 0;
/*
* Try to update our copy of cons by grabbing exclusive cmdq access. If
@@ -629,19 +622,16 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
llq->val = READ_ONCE(cmdq->q.llq.val);
- return 0;
+ return;
}
- queue_poll_init(smmu, &qp);
- do {
- llq->val = READ_ONCE(cmdq->q.llq.val);
- if (!queue_full(llq))
- break;
-
- ret = queue_poll(&qp);
- } while (!ret);
-
- return ret;
+ if (queue_poll(qp) == -ETIMEDOUT) {
+ dev_err_ratelimited(smmu->dev, "CMDQ timed out, cons: %08x, prod: 0x%08x\n",
+ llq->cons, llq->prod);
+ /* Restart the timer */
+ queue_poll_init(smmu, qp, false);
+ }
+ llq->val = READ_ONCE(cmdq->q.llq.val);
}
/*
@@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
local_irq_save(flags);
llq.val = READ_ONCE(cmdq->q.llq.val);
do {
+ struct arm_smmu_queue_poll qp;
u64 old;
+ /*
+ * Poll without WFE because:
+ * 1) Running out of space should be rare. Power saving is not
+ * an issue.
+ * 2) WFE depends on queue full break events, which occur only
+ * when the queue is full, but here we’re polling for
+ * sufficient space, not just queue full condition.
+ */
+ queue_poll_init(smmu, &qp, false);
while (!queue_has_space(&llq, n + sync)) {
local_irq_restore(flags);
- if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+ arm_smmu_cmdq_poll(smmu, cmdq, &llq, &qp);
local_irq_save(flags);
}
--
2.43.0
On Mon, Dec 08, 2025 at 01:28:56PM -0800, Jacob Pan wrote:
> @@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> local_irq_save(flags);
> llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> + struct arm_smmu_queue_poll qp;
> u64 old;
>
> + /*
> + * Poll without WFE because:
> + * 1) Running out of space should be rare. Power saving is not
> + * an issue.
> + * 2) WFE depends on queue full break events, which occur only
> + * when the queue is full, but here we’re polling for
> + * sufficient space, not just queue full condition.
> + */
I don't think this is reasonable; we should be able to use wfe instead of
polling on hardware that supports it and that is an important power-saving
measure in mobile parts.
If this is really an issue, we could take a spinlock around the
command-queue allocation loop for hardware with small queue sizes relative
to the number of CPUs, but it's not clear to me that we need to do anything
at all. I'm happy with the locking change in patch 3.
If we apply _only_ the locking change in the next patch, does that solve the
reported problem for you?
Will
Hi Will,
On Wed, 10 Dec 2025 12:16:19 +0900
Will Deacon <will@kernel.org> wrote:
> On Mon, Dec 08, 2025 at 01:28:56PM -0800, Jacob Pan wrote:
> > @@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
> > arm_smmu_device *smmu, local_irq_save(flags);
> > llq.val = READ_ONCE(cmdq->q.llq.val);
> > do {
> > + struct arm_smmu_queue_poll qp;
> > u64 old;
> >
> > + /*
> > + * Poll without WFE because:
> > + * 1) Running out of space should be rare. Power
> > saving is not
> > + * an issue.
> > + * 2) WFE depends on queue full break events,
> > which occur only
> > + * when the queue is full, but here we’re
> > polling for
> > + * sufficient space, not just queue full
> > condition.
> > + */
>
> I don't think this is reasonable; we should be able to use wfe
> instead of polling on hardware that supports it and that is an
> important power-saving measure in mobile parts.
>
After an offline discussion, I now understand that WFE essentially
stops the CPU clock, making energy savings almost always beneficial.
This differs from certain C-state or idle-state transitions, where the
energy-saving break-even point depends on how long the CPU remains
idle. Previously, I assumed power savings were not guaranteed due to
the unpredictability of wake events (e.g., timing relative to scheduler
ticks or queue-full conditions).
So I agree we should leverage WFE as much as we could here.
> If this is really an issue, we could take a spinlock around the
> command-queue allocation loop for hardware with small queue sizes
> relative to the number of CPUs, but it's not clear to me that we need
> to do anything at all. I'm happy with the locking change in patch 3.
>
> If we apply _only_ the locking change in the next patch, does that
> solve the reported problem for you?
Yes, please take #3. It should take care of the functional problem.
Thanks,
Jacob
© 2016 - 2025 Red Hat, Inc.