[PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback

zhuxiaohui posted 1 patch 1 month ago
block/blk-mq.c              | 12 ++++++++++--
drivers/nvme/host/core.c    |  5 ++++-
drivers/nvme/host/fabrics.c |  3 ++-
drivers/nvme/host/nvme.h    |  2 ++
include/linux/blk-mq.h      |  4 +++-
5 files changed, 21 insertions(+), 5 deletions(-)
[PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by zhuxiaohui 1 month ago
From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>

It is observed that nvme connect to a nvme over fabric target will
always fail when 'nohz_full' is set.

In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
and when nvme connect to a remote target, it may fails on this stack:

        blk_mq_alloc_request_hctx+1
        __nvme_submit_sync_cmd+106
        nvmf_connect_io_queue+181
        nvme_tcp_start_queue+293
        nvme_tcp_setup_ctrl+948
        nvme_tcp_create_ctrl+735
        nvmf_dev_write+532
        vfs_write+237
        ksys_write+107
        do_syscall_64+128
        entry_SYSCALL_64_after_hwframe+118

due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
blk_mq_ctx on the hw queue.

This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
indicate that block layer can fallback to a  blk_mq_ctx whose cpu
is not isolated.

Signed-off-by: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
---
 block/blk-mq.c              | 12 ++++++++++--
 drivers/nvme/host/core.c    |  5 ++++-
 drivers/nvme/host/fabrics.c |  3 ++-
 drivers/nvme/host/nvme.h    |  2 ++
 include/linux/blk-mq.h      |  4 +++-
 5 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index cf626e061dd7..e4e791fd6d80 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -654,8 +654,16 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
-	if (cpu >= nr_cpu_ids)
-		goto out_queue_exit;
+	if (cpu >= nr_cpu_ids) {
+		if (!(flags & BLK_MQ_REQ_ARB_MQ))
+			goto out_queue_exit;
+		/* fallback to the first cpu not isolated */
+		for_each_online_cpu(cpu) {
+			if (!cpu_is_isolated(cpu))
+				break;
+		}
+	}
+
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (q->elevator)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d..dbb9cb59e54c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1130,9 +1130,12 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		blk_flags |= BLK_MQ_REQ_RESERVED;
 	if (qid == NVME_QID_ANY)
 		req = blk_mq_alloc_request(q, nvme_req_op(cmd), blk_flags);
-	else
+	else {
+		if (flags & NVME_SUBMIT_ARB_MQ)
+			blk_flags |= BLK_MQ_REQ_ARB_MQ;
 		req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), blk_flags,
 						qid - 1);
+	}
 
 	if (IS_ERR(req))
 		return PTR_ERR(req);
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 432efcbf9e2f..ef34958e33c0 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -539,7 +539,8 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
 			data, sizeof(*data), qid,
 			NVME_SUBMIT_AT_HEAD |
 			NVME_SUBMIT_RESERVED |
-			NVME_SUBMIT_NOWAIT);
+			NVME_SUBMIT_NOWAIT |
+			NVME_SUBMIT_ARB_MQ);
 	if (ret) {
 		nvmf_log_connect_error(ctrl, ret, le32_to_cpu(res.u32),
 				       &cmd, data);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536..a61b35b1cd90 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -880,6 +880,8 @@ enum {
 	NVME_SUBMIT_RESERVED = (__force nvme_submit_flags_t)(1 << 2),
 	/* Retry command when NVME_STATUS_DNR is not set in the result */
 	NVME_SUBMIT_RETRY = (__force nvme_submit_flags_t)(1 << 3),
+	/* Submit command with arbitrary mq ctx */
+	NVME_SUBMIT_ARB_MQ = (__force nvme_submit_flags_t)(1 << 4),
 };
 
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4fecf46ef681..d14be341ea4b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -746,6 +746,8 @@ enum {
 	BLK_MQ_REQ_RESERVED	= (__force blk_mq_req_flags_t)(1 << 1),
 	/* set RQF_PM */
 	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 2),
+	/* use arbitrary mq ctx */
+	BLK_MQ_REQ_ARB_MQ	= (__force blk_mq_req_flags_t)(1 << 3),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
@@ -824,7 +826,7 @@ static inline int blk_mq_request_completed(struct request *rq)
 }
 
 /*
- * 
+ *
  * Set the state to complete when completing a request from inside ->queue_rq.
  * This is used by drivers that want to ensure special complete actions that
  * need access to the request are called on failure, e.g. by nvme for
-- 
2.39.5
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Ming Lei 1 month ago
On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
> From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
> 
> It is observed that nvme connect to a nvme over fabric target will
> always fail when 'nohz_full' is set.
> 
> In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
> isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
> and when nvme connect to a remote target, it may fails on this stack:
> 
>         blk_mq_alloc_request_hctx+1
>         __nvme_submit_sync_cmd+106
>         nvmf_connect_io_queue+181
>         nvme_tcp_start_queue+293
>         nvme_tcp_setup_ctrl+948
>         nvme_tcp_create_ctrl+735
>         nvmf_dev_write+532
>         vfs_write+237
>         ksys_write+107
>         do_syscall_64+128
>         entry_SYSCALL_64_after_hwframe+118
> 
> due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
> blk_mq_ctx on the hw queue.
> 
> This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
> as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
> indicate that block layer can fallback to a  blk_mq_ctx whose cpu
> is not isolated.

blk_mq_alloc_request_hctx()
	...
	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
	...

It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
online CPUs, both are same actually from this viewpoint.

It is one long-time problem for nvme fc.


Thanks,
Ming
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Sagi Grimberg 1 month ago


On 21/10/2024 4:39, Ming Lei wrote:
> On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
>> From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
>>
>> It is observed that nvme connect to a nvme over fabric target will
>> always fail when 'nohz_full' is set.
>>
>> In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
>> isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
>> and when nvme connect to a remote target, it may fails on this stack:
>>
>>          blk_mq_alloc_request_hctx+1
>>          __nvme_submit_sync_cmd+106
>>          nvmf_connect_io_queue+181
>>          nvme_tcp_start_queue+293
>>          nvme_tcp_setup_ctrl+948
>>          nvme_tcp_create_ctrl+735
>>          nvmf_dev_write+532
>>          vfs_write+237
>>          ksys_write+107
>>          do_syscall_64+128
>>          entry_SYSCALL_64_after_hwframe+118
>>
>> due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
>> blk_mq_ctx on the hw queue.
>>
>> This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
>> as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
>> indicate that block layer can fallback to a  blk_mq_ctx whose cpu
>> is not isolated.
> blk_mq_alloc_request_hctx()
> 	...
> 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> 	...
>
> It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
> online CPUs, both are same actually from this viewpoint.
>
> It is one long-time problem for nvme fc.

For what nvmf is using blk_mq_alloc_request_hctx() is not important. It 
just needs a tag from that hctx. the request execution is running where 
blk_mq_alloc_request_hctx() is running.
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Ming Lei 1 month ago
On Mon, Oct 21, 2024 at 10:05:34AM +0300, Sagi Grimberg wrote:
> 
> 
> 
> On 21/10/2024 4:39, Ming Lei wrote:
> > On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
> > > From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
> > > 
> > > It is observed that nvme connect to a nvme over fabric target will
> > > always fail when 'nohz_full' is set.
> > > 
> > > In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
> > > isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
> > > and when nvme connect to a remote target, it may fails on this stack:
> > > 
> > >          blk_mq_alloc_request_hctx+1
> > >          __nvme_submit_sync_cmd+106
> > >          nvmf_connect_io_queue+181
> > >          nvme_tcp_start_queue+293
> > >          nvme_tcp_setup_ctrl+948
> > >          nvme_tcp_create_ctrl+735
> > >          nvmf_dev_write+532
> > >          vfs_write+237
> > >          ksys_write+107
> > >          do_syscall_64+128
> > >          entry_SYSCALL_64_after_hwframe+118
> > > 
> > > due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
> > > blk_mq_ctx on the hw queue.
> > > 
> > > This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
> > > as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
> > > indicate that block layer can fallback to a  blk_mq_ctx whose cpu
> > > is not isolated.
> > blk_mq_alloc_request_hctx()
> > 	...
> > 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > 	...
> > 
> > It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
> > online CPUs, both are same actually from this viewpoint.
> > 
> > It is one long-time problem for nvme fc.
> 
> For what nvmf is using blk_mq_alloc_request_hctx() is not important. It just
> needs a tag from that hctx. the request execution is running where
> blk_mq_alloc_request_hctx() is running.

I am afraid that just one tag from the specified hw queue isn't enough.

The connection request needs to be issued to the hw queue & completed.
Without any online CPU for this hw queue, the request can't be completed
in case of managed-irq.


thanks,
Ming
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Sagi Grimberg 1 month ago


On 21/10/2024 11:31, Ming Lei wrote:
> On Mon, Oct 21, 2024 at 10:05:34AM +0300, Sagi Grimberg wrote:
>>
>>
>> On 21/10/2024 4:39, Ming Lei wrote:
>>> On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
>>>> From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
>>>>
>>>> It is observed that nvme connect to a nvme over fabric target will
>>>> always fail when 'nohz_full' is set.
>>>>
>>>> In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
>>>> isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
>>>> and when nvme connect to a remote target, it may fails on this stack:
>>>>
>>>>           blk_mq_alloc_request_hctx+1
>>>>           __nvme_submit_sync_cmd+106
>>>>           nvmf_connect_io_queue+181
>>>>           nvme_tcp_start_queue+293
>>>>           nvme_tcp_setup_ctrl+948
>>>>           nvme_tcp_create_ctrl+735
>>>>           nvmf_dev_write+532
>>>>           vfs_write+237
>>>>           ksys_write+107
>>>>           do_syscall_64+128
>>>>           entry_SYSCALL_64_after_hwframe+118
>>>>
>>>> due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
>>>> blk_mq_ctx on the hw queue.
>>>>
>>>> This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
>>>> as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
>>>> indicate that block layer can fallback to a  blk_mq_ctx whose cpu
>>>> is not isolated.
>>> blk_mq_alloc_request_hctx()
>>> 	...
>>> 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
>>> 	...
>>>
>>> It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
>>> online CPUs, both are same actually from this viewpoint.
>>>
>>> It is one long-time problem for nvme fc.
>> For what nvmf is using blk_mq_alloc_request_hctx() is not important. It just
>> needs a tag from that hctx. the request execution is running where
>> blk_mq_alloc_request_hctx() is running.
> I am afraid that just one tag from the specified hw queue isn't enough.
>
> The connection request needs to be issued to the hw queue & completed.
> Without any online CPU for this hw queue, the request can't be completed
> in case of managed-irq.

None of the consumers of this API use managed-irqs. the networking stack
takes care of steering irq vectors to online cpus.
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Ming Lei 1 month ago
On Mon, Oct 21, 2024 at 02:30:01PM +0300, Sagi Grimberg wrote:
> 
> 
> 
> On 21/10/2024 11:31, Ming Lei wrote:
> > On Mon, Oct 21, 2024 at 10:05:34AM +0300, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 21/10/2024 4:39, Ming Lei wrote:
> > > > On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
> > > > > From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
> > > > > 
> > > > > It is observed that nvme connect to a nvme over fabric target will
> > > > > always fail when 'nohz_full' is set.
> > > > > 
> > > > > In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
> > > > > isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
> > > > > and when nvme connect to a remote target, it may fails on this stack:
> > > > > 
> > > > >           blk_mq_alloc_request_hctx+1
> > > > >           __nvme_submit_sync_cmd+106
> > > > >           nvmf_connect_io_queue+181
> > > > >           nvme_tcp_start_queue+293
> > > > >           nvme_tcp_setup_ctrl+948
> > > > >           nvme_tcp_create_ctrl+735
> > > > >           nvmf_dev_write+532
> > > > >           vfs_write+237
> > > > >           ksys_write+107
> > > > >           do_syscall_64+128
> > > > >           entry_SYSCALL_64_after_hwframe+118
> > > > > 
> > > > > due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
> > > > > blk_mq_ctx on the hw queue.
> > > > > 
> > > > > This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
> > > > > as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
> > > > > indicate that block layer can fallback to a  blk_mq_ctx whose cpu
> > > > > is not isolated.
> > > > blk_mq_alloc_request_hctx()
> > > > 	...
> > > > 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > > > 	...
> > > > 
> > > > It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
> > > > online CPUs, both are same actually from this viewpoint.
> > > > 
> > > > It is one long-time problem for nvme fc.
> > > For what nvmf is using blk_mq_alloc_request_hctx() is not important. It just
> > > needs a tag from that hctx. the request execution is running where
> > > blk_mq_alloc_request_hctx() is running.
> > I am afraid that just one tag from the specified hw queue isn't enough.
> > 
> > The connection request needs to be issued to the hw queue & completed.
> > Without any online CPU for this hw queue, the request can't be completed
> > in case of managed-irq.
> 
> None of the consumers of this API use managed-irqs. the networking stack
> takes care of steering irq vectors to online cpus.

OK, it looks not necessary to AND with cpu_online_mask in
blk_mq_alloc_request_hctx, and the behavior is actually from commit
20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU").

But it is still too tricky as one API, please look at blk_mq_get_tag(), which may
allocate tag from other hw queue, instead of the specified one.

It is just lucky for connection request because IO isn't started
yet at that time, and the allocation always succeeds in the 1st try of
__blk_mq_get_tag().



thanks,
Ming
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Sagi Grimberg 1 month ago


On 21/10/2024 17:36, Ming Lei wrote:
> On Mon, Oct 21, 2024 at 02:30:01PM +0300, Sagi Grimberg wrote:
>>
>>
>> On 21/10/2024 11:31, Ming Lei wrote:
>>> On Mon, Oct 21, 2024 at 10:05:34AM +0300, Sagi Grimberg wrote:
>>>>
>>>> On 21/10/2024 4:39, Ming Lei wrote:
>>>>> On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
>>>>>> From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
>>>>>>
>>>>>> It is observed that nvme connect to a nvme over fabric target will
>>>>>> always fail when 'nohz_full' is set.
>>>>>>
>>>>>> In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
>>>>>> isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
>>>>>> and when nvme connect to a remote target, it may fails on this stack:
>>>>>>
>>>>>>            blk_mq_alloc_request_hctx+1
>>>>>>            __nvme_submit_sync_cmd+106
>>>>>>            nvmf_connect_io_queue+181
>>>>>>            nvme_tcp_start_queue+293
>>>>>>            nvme_tcp_setup_ctrl+948
>>>>>>            nvme_tcp_create_ctrl+735
>>>>>>            nvmf_dev_write+532
>>>>>>            vfs_write+237
>>>>>>            ksys_write+107
>>>>>>            do_syscall_64+128
>>>>>>            entry_SYSCALL_64_after_hwframe+118
>>>>>>
>>>>>> due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
>>>>>> blk_mq_ctx on the hw queue.
>>>>>>
>>>>>> This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
>>>>>> as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
>>>>>> indicate that block layer can fallback to a  blk_mq_ctx whose cpu
>>>>>> is not isolated.
>>>>> blk_mq_alloc_request_hctx()
>>>>> 	...
>>>>> 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
>>>>> 	...
>>>>>
>>>>> It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
>>>>> online CPUs, both are same actually from this viewpoint.
>>>>>
>>>>> It is one long-time problem for nvme fc.
>>>> For what nvmf is using blk_mq_alloc_request_hctx() is not important. It just
>>>> needs a tag from that hctx. the request execution is running where
>>>> blk_mq_alloc_request_hctx() is running.
>>> I am afraid that just one tag from the specified hw queue isn't enough.
>>>
>>> The connection request needs to be issued to the hw queue & completed.
>>> Without any online CPU for this hw queue, the request can't be completed
>>> in case of managed-irq.
>> None of the consumers of this API use managed-irqs. the networking stack
>> takes care of steering irq vectors to online cpus.
> OK, it looks not necessary to AND with cpu_online_mask in
> blk_mq_alloc_request_hctx, and the behavior is actually from commit
> 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU").

it is a long time ago...

>
> But it is still too tricky as one API, please look at blk_mq_get_tag(), which may
> allocate tag from other hw queue, instead of the specified one.

I don't see how it can help here.

>
> It is just lucky for connection request because IO isn't started
> yet at that time, and the allocation always succeeds in the 1st try of
> __blk_mq_get_tag().

It's not lucky, we reserve a per-queue tag for exactly this flow 
(connect) so we
always have one available. And when the connect is running, the driver 
should
guarantee nothing else is running.
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Ming Lei 1 month ago
On Mon, Oct 21, 2024 at 06:27:51PM +0300, Sagi Grimberg wrote:
> 
> 
> 
> On 21/10/2024 17:36, Ming Lei wrote:
> > On Mon, Oct 21, 2024 at 02:30:01PM +0300, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 21/10/2024 11:31, Ming Lei wrote:
> > > > On Mon, Oct 21, 2024 at 10:05:34AM +0300, Sagi Grimberg wrote:
> > > > > 
> > > > > On 21/10/2024 4:39, Ming Lei wrote:
> > > > > > On Sun, Oct 20, 2024 at 10:40:41PM +0800, zhuxiaohui wrote:
> > > > > > > From: Zhu Xiaohui <zhuxiaohui.400@bytedance.com>
> > > > > > > 
> > > > > > > It is observed that nvme connect to a nvme over fabric target will
> > > > > > > always fail when 'nohz_full' is set.
> > > > > > > 
> > > > > > > In commit a46c27026da1 ("blk-mq: don't schedule block kworker on
> > > > > > > isolated CPUs"), it clears hctx->cpumask for all isolate CPUs,
> > > > > > > and when nvme connect to a remote target, it may fails on this stack:
> > > > > > > 
> > > > > > >            blk_mq_alloc_request_hctx+1
> > > > > > >            __nvme_submit_sync_cmd+106
> > > > > > >            nvmf_connect_io_queue+181
> > > > > > >            nvme_tcp_start_queue+293
> > > > > > >            nvme_tcp_setup_ctrl+948
> > > > > > >            nvme_tcp_create_ctrl+735
> > > > > > >            nvmf_dev_write+532
> > > > > > >            vfs_write+237
> > > > > > >            ksys_write+107
> > > > > > >            do_syscall_64+128
> > > > > > >            entry_SYSCALL_64_after_hwframe+118
> > > > > > > 
> > > > > > > due to that the given blk_mq_hw_ctx->cpumask is cleared with no available
> > > > > > > blk_mq_ctx on the hw queue.
> > > > > > > 
> > > > > > > This patch introduce a new blk_mq_req_flags_t flag 'BLK_MQ_REQ_ARB_MQ'
> > > > > > > as well as a nvme_submit_flags_t 'NVME_SUBMIT_ARB_MQ' which are used to
> > > > > > > indicate that block layer can fallback to a  blk_mq_ctx whose cpu
> > > > > > > is not isolated.
> > > > > > blk_mq_alloc_request_hctx()
> > > > > > 	...
> > > > > > 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > > > > > 	...
> > > > > > 
> > > > > > It can happen in case of non-cpu-isolation too, such as when this hctx hasn't
> > > > > > online CPUs, both are same actually from this viewpoint.
> > > > > > 
> > > > > > It is one long-time problem for nvme fc.
> > > > > For what nvmf is using blk_mq_alloc_request_hctx() is not important. It just
> > > > > needs a tag from that hctx. the request execution is running where
> > > > > blk_mq_alloc_request_hctx() is running.
> > > > I am afraid that just one tag from the specified hw queue isn't enough.
> > > > 
> > > > The connection request needs to be issued to the hw queue & completed.
> > > > Without any online CPU for this hw queue, the request can't be completed
> > > > in case of managed-irq.
> > > None of the consumers of this API use managed-irqs. the networking stack
> > > takes care of steering irq vectors to online cpus.
> > OK, it looks not necessary to AND with cpu_online_mask in
> > blk_mq_alloc_request_hctx, and the behavior is actually from commit
> > 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU").
> 
> it is a long time ago...
> 
> > 
> > But it is still too tricky as one API, please look at blk_mq_get_tag(), which may
> > allocate tag from other hw queue, instead of the specified one.
> 
> I don't see how it can help here.

Without taking offline cpus into account, every hctx has CPUs mapped
except for cpu isolation, then the failure of 'cpu >= nr_cpu_ids' won't
be triggered.

> 
> > 
> > It is just lucky for connection request because IO isn't started
> > yet at that time, and the allocation always succeeds in the 1st try of
> > __blk_mq_get_tag().
> 
> It's not lucky, we reserve a per-queue tag for exactly this flow (connect)
> so we
> always have one available. And when the connect is running, the driver
> should
> guarantee nothing else is running.

What if there is multiple concurrent allocation(reserve) requests? You still
may run into allocation from other hw queue. In reality, nvme may don't
use in that way, but as one API, it is still not good, or at least the
behavior should be documented.


thanks,
Ming
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Sagi Grimberg 1 month ago
>
>>> It is just lucky for connection request because IO isn't started
>>> yet at that time, and the allocation always succeeds in the 1st try of
>>> __blk_mq_get_tag().
>> It's not lucky, we reserve a per-queue tag for exactly this flow (connect)
>> so we
>> always have one available. And when the connect is running, the driver
>> should
>> guarantee nothing else is running.
> What if there is multiple concurrent allocation(reserve) requests?

There can't be none.

>   You still
> may run into allocation from other hw queue. In reality, nvme may don't
> use in that way, but as one API, it is still not good, or at least the
> behavior should be documented.

I agree. NVMe may have a unique need here, but it needs a tag from a
specific hctx while the context requesting it does not map according to
the hctx cpumap. It cannot use any other tag from any other hctx.

The reason is that the connect for a queue must be done from a tag that
belongs to the queue because nvme relies on it when it does resolution 
back to
the request to the completion.
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Christoph Hellwig 1 month ago
On Tue, Oct 22, 2024 at 04:23:29PM +0300, Sagi Grimberg wrote:
> I agree. NVMe may have a unique need here, but it needs a tag from a
> specific hctx while the context requesting it does not map according to
> the hctx cpumap. It cannot use any other tag from any other hctx.
>
> The reason is that the connect for a queue must be done from a tag that
> belongs to the queue because nvme relies on it when it does resolution back 
> to
> the request to the completion.

I wonder if we should byte the bullet and not use a request for the
connect commands.  We've already special cased the AEN command because
it was causing too many problems, and given all the pain so far connect
might also have hit that treshold.
Re: [PATCH v1] blk-mq: add one blk_mq_req_flags_t type to support mq ctx fallback
Posted by Sagi Grimberg 1 month ago


On 23/10/2024 8:19, Christoph Hellwig wrote:
> On Tue, Oct 22, 2024 at 04:23:29PM +0300, Sagi Grimberg wrote:
>> I agree. NVMe may have a unique need here, but it needs a tag from a
>> specific hctx while the context requesting it does not map according to
>> the hctx cpumap. It cannot use any other tag from any other hctx.
>>
>> The reason is that the connect for a queue must be done from a tag that
>> belongs to the queue because nvme relies on it when it does resolution back
>> to
>> the request to the completion.
> I wonder if we should byte the bullet and not use a request for the
> connect commands.  We've already special cased the AEN command because
> it was causing too many problems, and given all the pain so far connect
> might also have hit that treshold.

That would be trading one set of problems for another. We had some hard
times in the past to correctly fence AERs against controller states 
because it
is not a normal block request.

Plus, unlike AERs, connect has a timeout, which we'll need to take care 
of...

Not that it is not possible, I'd just like to avoid this option.