[v2] SMMU v3 CMDQ fix and improvement

[PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

Posted by Jacob Pan 3 months, 2 weeks ago

From: Alexander Grest <Alexander.Grest@microsoft.com>

The SMMU CMDQ lock is highly contentious when there are multiple CPUs
issuing commands on an architecture with small queue sizes e.g 256
entries.

The lock has the following states:
 - 0:		Unlocked
 - >0:		Shared lock held with count
 - INT_MIN+N:	Exclusive lock held, where N is the # of shared waiters
 - INT_MIN:	Exclusive lock held, no shared waiters

When multiple CPUs are polling for space in the queue, they attempt to
grab the exclusive lock to update the cons pointer from the hardware. If
they fail to get the lock, they will spin until either the cons pointer
is updated by another CPU.

The current code allows the possibility of shared lock starvation
if there is a constant stream of CPUs trying to grab the exclusive lock.
This leads to severe latency issues and soft lockups.

To mitigate this, we release the exclusive lock by only clearing the sign
bit while retaining the shared lock waiter count as a way to avoid
starving the shared lock waiters.

Also deleted cmpxchg loop while trying to acquire the shared lock as it
is not needed. The waiters can see the positive lock count and proceed
immediately after the exclusive lock is released.

Exclusive lock is not starved in that submitters will try exclusive lock
first when new spaces become available.

In a staged test where 32 CPUs issue SVA invalidations simultaneously on
a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
dropped by 50% with this patch and without soft lockups.

Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v2:
	- Changed shared lock acquire condition from VAL>=0 to VAL>0
	  (Mostafa)
	- Added more comments to explain shared lock change (Nicolin)
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 ++++++++++++++-------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6959d99c74a3..9e632bb022fe 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -460,20 +460,26 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  */
 static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 {
-	int val;
-
 	/*
-	 * We can try to avoid the cmpxchg() loop by simply incrementing the
-	 * lock counter. When held in exclusive state, the lock counter is set
-	 * to INT_MIN so these increments won't hurt as the value will remain
-	 * negative.
+	 * When held in exclusive state, the lock counter is set to INT_MIN
+	 * so these increments won't hurt as the value will remain negative.
+	 * The increment will also signal the exclusive locker that there are
+	 * shared waiters.
 	 */
 	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
 		return;
 
-	do {
-		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	/*
+	 * Someone else is holding the lock in exclusive state, so wait
+	 * for them to finish. Since we already incremented the lock counter,
+	 * no exclusive lock can be acquired until we finish. We don't need
+	 * the return value since we only care that the exclusive lock is
+	 * released (i.e. the lock counter is non-negative).
+	 * Once the exclusive locker releases the lock, the sign bit will
+	 * be cleared and our increment will make the lock counter positive,
+	 * allowing us to proceed.
+	 */
+	atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -500,9 +506,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
 	__ret;								\
 })
 
+/*
+ * Only clear the sign bit when releasing the exclusive lock this will
+ * allow any shared_lock() waiters to proceed without the possibility
+ * of entering the exclusive lock in a tight loop.
+ */
 #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags)		\
 ({									\
-	atomic_set_release(&cmdq->lock, 0);				\
+	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);				\
 	local_irq_restore(flags);					\
 })
 
-- 
2.43.0

Re: [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

Posted by Nicolin Chen 3 months, 1 week ago

On Mon, Oct 20, 2025 at 03:43:53PM -0700, Jacob Pan wrote:
> From: Alexander Grest <Alexander.Grest@microsoft.com>
> 
> The SMMU CMDQ lock is highly contentious when there are multiple CPUs
> issuing commands on an architecture with small queue sizes e.g 256
> entries.

As Robin pointed out that 256 entry itself is not quite normal,
the justification here might still not be very convincing..

I'd suggest to avoid saying "an architecture with a small queue
sizes, but to focus on the issue itself -- potential starvation.
"256-entry" can be used a testing setup to reproduce the issue.

> The lock has the following states:
>  - 0:		Unlocked
>  - >0:		Shared lock held with count
>  - INT_MIN+N:	Exclusive lock held, where N is the # of shared waiters
>  - INT_MIN:	Exclusive lock held, no shared waiters
> 
> When multiple CPUs are polling for space in the queue, they attempt to
> grab the exclusive lock to update the cons pointer from the hardware. If
> they fail to get the lock, they will spin until either the cons pointer
> is updated by another CPU.
> 
> The current code allows the possibility of shared lock starvation
> if there is a constant stream of CPUs trying to grab the exclusive lock.
> This leads to severe latency issues and soft lockups.

It'd be nicer to have a graph to show how the starvation might
happen due to a race:

CPU0 (exclusive)  | CPU1 (shared)     | CPU2 (exclusive)    | `cmdq->lock`
--------------------------------------------------------------------------
trylock() //takes |                   |                     | 0
                  | shared_lock()     |                     | INT_MIN
                  | fetch_inc()       |                     | INT_MIN
                  | no return         |                     | INT_MIN + 1
                  | spins // VAL >= 0 |                     | INT_MIN + 1
unlock()          | spins...          |                     | INT_MIN + 1
set_release(0)    | spins...          |                     | 0  <-- BUG?
(done)            | (sees 0)          | trylock() // takes  | 0
                  | *exits loop*      | cmpxchg(0, INT_MIN) | 0
                  |                   | *cuts in*           | INT_MIN
                  | cmpxchg(0, 1)     |                     | INT_MIN
                  | fails // != 0     |                     | INT_MIN
                  | spins // VAL >= 0 |                     | INT_MIN
                  | *starved*         |                     | INT_MIN

And point it out that it should have reserved the "+1" from CPU1
instead of nuking the entire cmdq->lock to 0.

> In a staged test where 32 CPUs issue SVA invalidations simultaneously on
> a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
> dropped by 50% with this patch and without soft lockups.

This might not be very useful per Robin's remarks. I'd drop it.

> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> @@ -500,9 +506,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
>  	__ret;								\
>  })
>  
> +/*
> + * Only clear the sign bit when releasing the exclusive lock this will
> + * allow any shared_lock() waiters to proceed without the possibility
> + * of entering the exclusive lock in a tight loop.
> + */
>  #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags)		\
>  ({									\
> -	atomic_set_release(&cmdq->lock, 0);				\
> +	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);				\

Align the tailing spacing with other lines please.

Nicolin

Re: [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

Posted by Jacob Pan 3 months ago

Hi Nicolin,

On Thu, 30 Oct 2025 19:00:02 -0700
Nicolin Chen <nicolinc@nvidia.com> wrote:

> On Mon, Oct 20, 2025 at 03:43:53PM -0700, Jacob Pan wrote:
> > From: Alexander Grest <Alexander.Grest@microsoft.com>
> > 
> > The SMMU CMDQ lock is highly contentious when there are multiple
> > CPUs issuing commands on an architecture with small queue sizes e.g
> > 256 entries.  
> 
> As Robin pointed out that 256 entry itself is not quite normal,
> the justification here might still not be very convincing..
> 
> I'd suggest to avoid saying "an architecture with a small queue
> sizes, but to focus on the issue itself -- potential starvation.
> "256-entry" can be used a testing setup to reproduce the issue.
> 
> > The lock has the following states:
> >  - 0:		Unlocked  
> >  - >0:		Shared lock held with count  
> >  - INT_MIN+N:	Exclusive lock held, where N is the # of
> > shared waiters
> >  - INT_MIN:	Exclusive lock held, no shared waiters
> > 
> > When multiple CPUs are polling for space in the queue, they attempt
> > to grab the exclusive lock to update the cons pointer from the
> > hardware. If they fail to get the lock, they will spin until either
> > the cons pointer is updated by another CPU.
> > 
> > The current code allows the possibility of shared lock starvation
> > if there is a constant stream of CPUs trying to grab the exclusive
> > lock. This leads to severe latency issues and soft lockups.  
> 
> It'd be nicer to have a graph to show how the starvation might
> happen due to a race:
> 
> CPU0 (exclusive)  | CPU1 (shared)     | CPU2 (exclusive)    |
> `cmdq->lock`
> --------------------------------------------------------------------------
> trylock() //takes |                   |                     | 0 |
> shared_lock()     |                     | INT_MIN | fetch_inc()
> |                     | INT_MIN | no return         |
>     | INT_MIN + 1 | spins // VAL >= 0 |                     | INT_MIN
> + 1 unlock()          | spins...          |                     |
> INT_MIN + 1 set_release(0)    | spins...          |
>   | 0  <-- BUG? 
Not sure we can call it a bug but it definitely opens the door for
starving shared lock.

>(done)            | (sees 0)          | trylock() //
> takes  | 0 | *exits loop*      | cmpxchg(0, INT_MIN) | 0
>                   |                   | *cuts in*           | INT_MIN
>                   | cmpxchg(0, 1)     |                     | INT_MIN
>                   | fails // != 0     |                     | INT_MIN
>                   | spins // VAL >= 0 |                     | INT_MIN
>                   | *starved*         |                     | INT_MIN
>
Thanks for the graph, will incorporate. The starved shared lock also
prevents advancing cmdq which perpetuate the situation of
!queue_has_space(&llq, n + sync)
 
> And point it out that it should have reserved the "+1" from CPU1
> instead of nuking the entire cmdq->lock to 0.
> 
Will do. reserved the "+1" is useful to prevent back to back exclusive
lock acquisition. Nuking to 0 wasted such info.

> > In a staged test where 32 CPUs issue SVA invalidations
> > simultaneously on a system with a 256 entry queue, the madvise
> > (MADV_DONTNEED) latency dropped by 50% with this patch and without
> > soft lockups.  
> 
> This might not be very useful per Robin's remarks. I'd drop it.
> 
Will do.

> > Reviewed-by: Mostafa Saleh <smostafa@google.com>
> > Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>  
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> > @@ -500,9 +506,14 @@ static bool
> > arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
> > __ret;
> > 	\ }) 
> > +/*
> > + * Only clear the sign bit when releasing the exclusive lock this
> > will
> > + * allow any shared_lock() waiters to proceed without the
> > possibility
> > + * of entering the exclusive lock in a tight loop.
> > + */
> >  #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq,
> > flags)		\ ({
> > 				\
> > -	atomic_set_release(&cmdq->lock, 0);
> > 	\
> > +	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);
> > 			\  
> 
> Align the tailing spacing with other lines please.
> 
> Nicolin

[PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
[PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency