nvmet-fc blktests & autoconnect fixes

[RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 5 months ago

The port entry binding might not be existing and thus the req->port
pointer is not valid.

Reproducer: nvme/005 with active system nvmf-autoconnect systemd service.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/target/discovery.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c
index 668d257fa986..fc113057cb95 100644
--- a/drivers/nvme/target/discovery.c
+++ b/drivers/nvme/target/discovery.c
@@ -191,6 +191,15 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req)
 		goto out;
 	}
 
+
+	/* No port assigned, portentrybinding is missing */
+	if (!req->port) {
+		req->error_loc =
+			offsetof(struct nvme_get_log_page_command, lpo);
+		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
+		goto out;
+	}
+
 	/*
 	 * Make sure we're passing at least a buffer of response header size.
 	 * If host provided data len is less than the header size, only the
-- 
2.41.0

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Hannes Reinecke 2 years, 5 months ago

On 8/29/23 11:13, Daniel Wagner wrote:
> The port entry binding might not be existing and thus the req->port
> pointer is not valid.
> 
> Reproducer: nvme/005 with active system nvmf-autoconnect systemd service.
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>   drivers/nvme/target/discovery.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c
> index 668d257fa986..fc113057cb95 100644
> --- a/drivers/nvme/target/discovery.c
> +++ b/drivers/nvme/target/discovery.c
> @@ -191,6 +191,15 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req)
>   		goto out;
>   	}
>   
> +
> +	/* No port assigned, portentrybinding is missing */

Missing space.

> +	if (!req->port) {
> +		req->error_loc =
> +			offsetof(struct nvme_get_log_page_command, lpo);
> +		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
> +		goto out;
> +	}
> +
>   	/*
>   	 * Make sure we're passing at least a buffer of response header size.
>   	 * If host provided data len is less than the header size, only the
Otherwise looks good.

Cheers,

Hannes

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Christoph Hellwig 2 years, 5 months ago

On Tue, Aug 29, 2023 at 11:13:49AM +0200, Daniel Wagner wrote:
> The port entry binding might not be existing and thus the req->port
> pointer is not valid.
> 
> Reproducer: nvme/005 with active system nvmf-autoconnect systemd service.
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  drivers/nvme/target/discovery.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c
> index 668d257fa986..fc113057cb95 100644
> --- a/drivers/nvme/target/discovery.c
> +++ b/drivers/nvme/target/discovery.c
> @@ -191,6 +191,15 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req)
>  		goto out;
>  	}
>  
> +
> +	/* No port assigned, portentrybinding is missing */

Double new line above, and I think a missing white space before
binding.  But I'm still confused how we can get here without req->port
set.  Can you try to do a little more analysis as I suspect we have
a deeper problem somewhere.

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 5 months ago

On Tue, Sep 05, 2023 at 08:50:32AM +0200, Christoph Hellwig wrote:
> > +	/* No port assigned, portentrybinding is missing */
> 
> Double new line above, and I think a missing white space before
> binding.

Yep, sorry.

> But I'm still confused how we can get here without req->port
> set.  Can you try to do a little more analysis as I suspect we have
> a deeper problem somewhere.

I am only able to reproduce this if there are two connect/discovery
attempts happening at the same time. I'll collect some logs and attempt
to make some sense out of it.

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 5 months ago

On Tue, Sep 05, 2023 at 12:40:25PM +0200, Daniel Wagner wrote:
> > But I'm still confused how we can get here without req->port
> > set.  Can you try to do a little more analysis as I suspect we have
> > a deeper problem somewhere.

The problem is that nvme/005 starts to cleanup all resources and there
is a race between the cleanup path and the host trying to figure out
what's going on (get log page).

We have 3 association:

 assoc 0: systemd/udev triggered 'nvme connect-all' discovery controller
 assoc 1: discovery controller from nvme/005
 assoc 2: i/o controller from nvme/005

nvme/005 issues a reset_controller but doesn't waiting or checking the
result. Instead we go directly into the resource cleanup part, nvme
disconnect which removes assoc 1 and assoc. Then the target cleanup
part starts. At this point, assoc 0 is still around.

 nvme nvme3: Removing ctrl: NQN "blktests-subsystem-1"
 block nvme3n1: no available path - failing I/O
 block nvme3n1: no available path - failing I/O
 Buffer I/O error on dev nvme3n1, logical block 89584, async page read
 (NULL device *): {0:2} Association deleted
 nvmet_fc: nvmet_fc_portentry_unbind: tgtport 000000004f5c9138 pe 00000000e2a2da84
 [321] nvmet: ctrl 2 stop keep-alive
 (NULL device *): {0:1} Association freed
 (NULL device *): Disconnect LS failed: No Association
 general protection fault, probably for non-canonical address 0xdffffc00000000a4: 0000 [#1] PREEMPT SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000520-0x0000000000000527]
 CPU: 1 PID: 250 Comm: kworker/1:4 Tainted: G        W          6.5.0-rc2+ #20 e82c2becb08b573f1fa41dfeddc70ac8f6838a63
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
 Workqueue: nvmet-wq fcloop_fcp_recv_work [nvme_fcloop]
 RIP: 0010:nvmet_execute_disc_get_log_page+0x23f/0x8c0 [nvmet]

The target cleanup removes the port from the subsystem
(nvmet_fc_portentry_unbind) and does not check if there is still a
association around. Right after we have removed assoc 1 and 2 the host
sends a get log page command on assoc 0. Though we have remove the port
binding and thus the pointer when nvmet_execute_disc_get_log_page gets
executed.

I am still pondering how to fix this.

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 5 months ago

On Mon, Sep 11, 2023 at 04:44:47PM +0200, Daniel Wagner wrote:
> On Tue, Sep 05, 2023 at 12:40:25PM +0200, Daniel Wagner wrote:
> > > But I'm still confused how we can get here without req->port
> > > set.  Can you try to do a little more analysis as I suspect we have
> > > a deeper problem somewhere.
> 
> The problem is that nvme/005 starts to cleanup all resources and there
> is a race between the cleanup path and the host trying to figure out
> what's going on (get log page).
> 
> We have 3 association:
> 
>  assoc 0: systemd/udev triggered 'nvme connect-all' discovery controller
>  assoc 1: discovery controller from nvme/005
>  assoc 2: i/o controller from nvme/005

Actually, assoc 1 is also a i/o controller for the same hostnqn as assoc
2. This looks more like it assoc 0 and 1 are both from systemd/udev
trigger. But why the heck is the hostnqn for assoc 1 the same as the
hostnqn we use in blktests? Something is really off here.

The rest of my analysis is still valid.

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 4 months ago

 > We have 3 association:
> > 
> >  assoc 0: systemd/udev triggered 'nvme connect-all' discovery controller
> >  assoc 1: discovery controller from nvme/005
> >  assoc 2: i/o controller from nvme/005
> 
> Actually, assoc 1 is also a i/o controller for the same hostnqn as assoc
> 2. This looks more like it assoc 0 and 1 are both from systemd/udev
> trigger. But why the heck is the hostnqn for assoc 1 the same as the
> hostnqn we use in blktests? Something is really off here.
> 
> The rest of my analysis is still valid.

I figured it out. I still used an older version nvme-cli which didn't
apply the hostnqn correctly. We fixed this in the meantime. With the
latest git version of nvme-cli the second I/O controller is not setup
anymore and the crash is gone. Though we still don't cleanup all
resources as the kernel module complains with

[41707.083039] nvmet_fc: nvmet_fc_exit_module: targetport list not empty

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Christoph Hellwig 2 years, 4 months ago

So that's interesting.  But what I'm mostly worried about is how the
nvmet kernel code allows a request without ->port to progress to the
actual command handler.  We should never allow a command to get that
far if ->port is NULL, and should not allow to clear ->port while
commands are still handled.

Re: [RFC v1 4/4] nvmet-discovery: do not use invalid port

Posted by Daniel Wagner 2 years, 4 months ago

On Wed, Sep 13, 2023 at 01:35:19PM +0200, Christoph Hellwig wrote:
> So that's interesting.  But what I'm mostly worried about is how the
> nvmet kernel code allows a request without ->port to progress to the
> actual command handler.

nvmet_fc_handle_fcp_rqst()

	if (tgtport->pe)
		fod->req.port = tgtport->pe->port;

Not sure why this is there. Will test what happens when we just return
an error when we don't have pe set.

> We should never allow a command to get that
> far if ->port is NULL, and should not allow to clear ->port while
> commands are still handled.

Okay, makes sense. I'll test this when I have access to my rig again tomorrow.