block/blk-mq: fix RT kernel issues and interrupt context warnings

[PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path

Posted by Ionut Nechita (WindRiver) 1 month, 2 weeks ago

From: Ionut Nechita <ionut.nechita@windriver.com>

Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.

On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.

Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt):  640 MB/s sequential read
- Bad  (v6.6.64-rt):  153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock

The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.

Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates

This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.

Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable@vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 block/blk-mq.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 
 	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
 
+	/*
+	 * First lockless check to avoid unnecessary overhead.
+	 * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+	 */
 	need_run = blk_mq_hw_queue_need_run(hctx);
 	if (!need_run) {
-		unsigned long flags;
-
-		/*
-		 * Synchronize with blk_mq_unquiesce_queue(), because we check
-		 * if hw queue is quiesced locklessly above, we need the use
-		 * ->queue_lock to make sure we see the up-to-date status to
-		 * not miss rerunning the hw queue.
-		 */
-		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+		/* Synchronize with blk_mq_unquiesce_queue() */
+		smp_mb();
 		need_run = blk_mq_hw_queue_need_run(hctx);
-		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
 		if (!need_run)
 			return;
+		/* Ensure dispatch list/sw queue updates visible before execution */
+		smp_mb();
 	}
 
 	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
-- 
2.52.0

Re: [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path

Posted by Muchun Song 1 month, 2 weeks ago


On 2025/12/23 04:15, Ionut Nechita (WindRiver) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> Commit 679b1874eba7 ("block: fix ordering between checking
> QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
> in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.
>
> On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
> rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
> completions concurrently, they contend on queue_lock in the hot path,
> causing all IRQ threads to enter D (uninterruptible sleep) state. This
> serializes interrupt processing completely.
>
> Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
> - Good (v6.6.52-rt):  640 MB/s sequential read
> - Bad  (v6.6.64-rt):  153 MB/s sequential read (-76% regression)
> - 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock
>
> The original commit message mentioned memory barriers as an alternative
> approach. Use full memory barriers (smp_mb) instead of queue_lock to
> provide the same ordering guarantees without sleeping in RT kernel.
>
> Memory barriers ensure proper synchronization:
> - CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
> - CPU1 sees dispatch list/sw queue bitmap updates
>
> This maintains correctness while avoiding lock contention that causes
> RT kernel IRQ threads to sleep in the I/O completion path.
>
> Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>   block/blk-mq.c | 19 ++++++++-----------
>   1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5da948b07058..5fb8da4958d0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>   
>   	might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
>   
> +	/*
> +	 * First lockless check to avoid unnecessary overhead.
> +	 * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
> +	 */
>   	need_run = blk_mq_hw_queue_need_run(hctx);
>   	if (!need_run) {
> -		unsigned long flags;
> -
> -		/*
> -		 * Synchronize with blk_mq_unquiesce_queue(), because we check
> -		 * if hw queue is quiesced locklessly above, we need the use
> -		 * ->queue_lock to make sure we see the up-to-date status to
> -		 * not miss rerunning the hw queue.
> -		 */
> -		spin_lock_irqsave(&hctx->queue->queue_lock, flags);
> +		/* Synchronize with blk_mq_unquiesce_queue() */

Memory barriers must be used in pairs. So how to synchronize?

> +		smp_mb();
>   		need_run = blk_mq_hw_queue_need_run(hctx);
> -		spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
> -
>   		if (!need_run)
>   			return;
> +		/* Ensure dispatch list/sw queue updates visible before execution */
> +		smp_mb();

Why we need another barrier? What order does this barrier guarantee?

Thanks.
>   	}
>   
>   	if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {

Re: [PATCH v2 1/2] block/blk-mq: fix RT kernel regression with queue_lock in hot path

Posted by djiony2011@gmail.com 1 month ago

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Muchun,

Thank you for the detailed review. Your questions about the memory barriers are
absolutely correct and highlight fundamental issues with my approach.

> Memory barriers must be used in pairs. So how to synchronize?

> Why we need another barrier? What order does this barrier guarantee?

You're right to ask these questions. After careful consideration and discussion
with Ming Lei, I've concluded that the memory barrier approach in this patch is
flawed and insufficient.

The fundamental problem is:
1. Memory barriers need proper pairing on both read and write sides
2. The write-side barriers would need to be inserted at MULTIPLE call sites
   throughout the block layer - everywhere work is added before calling
   blk_mq_run_hw_queue()
3. This is exactly why the original commit 679b1874eba7 chose the lock-based
   approach, noting that "memory barrier is not easy to be maintained"

My patch attempted to add barriers only in blk_mq_run_hw_queue(), but didn't
address the pairing barriers needed at all the call sites that add work to
dispatch lists/sw queues. This makes the synchronization incomplete.

## New approach: dedicated raw_spinlock

I'm abandoning the memory barrier approach and preparing a new patch that uses
a dedicated raw_spinlock_t (quiesce_sync_lock) instead of the general-purpose
queue_lock.

The key differences from the current problematic code:
- Current: Uses queue_lock (spinlock_t) which becomes rt_mutex in RT kernel
- New: Uses quiesce_sync_lock (raw_spinlock_t) which stays a real spinlock

Why raw_spinlock is safe:
- Critical section is provably short (only flag and counter checks)
- No sleeping operations under the lock
- Specific to quiesce synchronization, not general queue operations

This approach:
- Maintains the correct synchronization from 679b1874eba7
- Avoids sleeping in RT kernel's IRQ thread context
- Simpler and more maintainable than memory barrier pairing across many call sites

I'll send the new patch shortly. Thank you for catching these issues before
they made it into the kernel.

Best regards,
Ionut