From nobody Mon Nov 25 05:21:33 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFE72206E9C for ; Tue, 29 Oct 2024 18:40:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730227218; cv=none; b=dYZ5RN/83k2xJOiTlv1SmSf/IQqiKkMtbX6+tVnlvMIfTLSI83sFh7K8qIOWFss0a8/mpzxb7SNAle/OP9j7ZD1CMKCHQReVuArAvQQ97j2eSbx7gSnxnOI0eAVpjDCjAVO0hRXBXCAZl+tghl+ZycIfE5pu+0Xn2LDnORXG20M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730227218; c=relaxed/simple; bh=VDZ0GWvuIDKmyjXw4HNBrgCTRGRRu8R8QLoqZy/Abbs=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=EwhRv7/Krs/uyo+Q15BEThuSUjgdMheRr8+06kKxA24gH4jzh+8AaitAShn8P//DpN0UM4LEab7SB7oobYqLdtPCiUCuv5E/r01L4GlvFJanjWnBDbkfmB06sNBOFy1qJjdSIawdaj6WbMrl5NcEp4uwuRM4zd4tCtgxCNNogGI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=hVTiQCo5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="hVTiQCo5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2C8EDC4CEEC; Tue, 29 Oct 2024 18:40:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1730227217; bh=VDZ0GWvuIDKmyjXw4HNBrgCTRGRRu8R8QLoqZy/Abbs=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=hVTiQCo54748IXplr9fghSmm3UtWEbpWWa7I/JNYnKmqYvJZ9+WXdXG8Q+sQGMGpb S3Ai73vp5dxIdtgGjH6XA4sj2R+S0IbwKZWdvO/+NPRXrnikgtbgBO4+gdIBRj5HnG hQfkoyO/XB/rCrzCDvXZ0vZsLqdRRGxfZR6k/huHptj9AFNrtroLSmGMzjgjZeTKys rsB3eH1YBImyunPOMxVQ7L7reg9FQtRSzZkKLih3JbrTrqb9gVyolXfTp2KJhl4W6U JX7CnvnzHrM3KEVZj8496xjpmEIbrjxJlqNtZ3USCwd0t4PgJPiv4T4Nq7pZibSe7Z xtZwc7b3sm5JQ== From: Daniel Wagner Date: Tue, 29 Oct 2024 19:40:11 +0100 Subject: [PATCH v2 1/2] nvme-fc: go straight to connecting state when initializing Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20241029-nvme-fc-handle-com-lost-v2-1-5b0d137e2a0a@kernel.org> References: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org> In-Reply-To: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org> To: James Smart , Keith Busch , Christoph Hellwig , Sagi Grimberg , Hannes Reinecke , Paul Ely Cc: linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, Daniel Wagner X-Mailer: b4 0.14.2 The initial controller initialization mimiks the reconnect loop behavior by switching from NEW to RESETTING and then to CONNECTING. The transition from NEW to CONNECTING is a valid transition, so there is no point entering the RESETTING state. TCP and RDMA also transition directly to CONNECTING state. Reviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke Signed-off-by: Daniel Wagner Reviewed-by: Christoph Hellwig --- drivers/nvme/host/fc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index b81af7919e94c421387033bf8361a9cf8a867486..d45ab530ff9b7bd03bc31147427= 8fc840f8786d5 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -3579,8 +3579,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctr= l_options *opts, list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list); spin_unlock_irqrestore(&rport->lock, flags); =20 - if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) || - !nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { dev_err(ctrl->ctrl.device, "NVME-FC{%d}: failed to init ctrl state\n", ctrl->cnum); goto fail_ctrl; --=20 2.47.0 From nobody Mon Nov 25 05:21:33 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 731CF2076AE for ; Tue, 29 Oct 2024 18:40:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730227220; cv=none; b=HgBETEmbLXJeLonC0F0TvODqW/J9SjTQTb8pnEbvY3lPb3nDU1Bo+IURnv3EoL6dMOTljK4t/5B7GXTkztieY+aVZuAbPE7+w4c8UhrqFsf68QRHARmQQygPiHLdCYcGhXwxLTpIpWZI8R62ckAkC7Kj4KVGIZF1saM/sUfnfY0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730227220; c=relaxed/simple; bh=YipPWOMTdVKjtMIdpDJ7zD4aTB5Vym5i0pgmtjyFGsE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=qWrshZ1ia37xpT2S+hSvgA55yey/gupmfSW1b8h5iVjQ4/rTcRxqMLXlkdjL25wi/46iEX61T2njZlVLz9xK3e92cqxKJS//uNlYlIkbRbSHaFvL623c7CPXKtFW2ktdENxAeyts6yyTecoCJTiMq74FAlBOULN04LdAPyR11lI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=M+J1avRw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="M+J1avRw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A8715C4CEE4; Tue, 29 Oct 2024 18:40:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1730227220; bh=YipPWOMTdVKjtMIdpDJ7zD4aTB5Vym5i0pgmtjyFGsE=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=M+J1avRwzeSpv63jmRTtzo4bbGYbYTpfRQOQf15yuBhEnpYDLJ6pyzZRgsLh6gBNw e8TXysIPRfcIMtX7U94mGHiVE0eRF7Jb/8TvkGp5KL6BICoGF/XTbKTcCFLrlnYqjb XSusce0LRRSaJwoGdwscfD2P7z0jOsJiSiasRMlfczEuEQozc1UQL+7FRSGS4Pote8 1m8UYYCeS79l0jmUL3j0eVuuyfNN/TMnEN/gniMAxfpON9MOJj7+9SjqlBm9bh594c afVfDqW9KIPT/0URXOnDxvukXhB8vat4Kt9919ERvLW45ezoO6/vY72TQ/btxk1mLo wSJbppD4SG6hQ== From: Daniel Wagner Date: Tue, 29 Oct 2024 19:40:12 +0100 Subject: [PATCH v2 2/2] nvme: handle connectivity loss in nvme_set_queue_count Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20241029-nvme-fc-handle-com-lost-v2-2-5b0d137e2a0a@kernel.org> References: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org> In-Reply-To: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org> To: James Smart , Keith Busch , Christoph Hellwig , Sagi Grimberg , Hannes Reinecke , Paul Ely Cc: linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, Daniel Wagner X-Mailer: b4 0.14.2 nvme_set_queue_count is called when setting up the IO queues. When updating of the queue number fails, the function will ignore all NVME errors. The assumption is that the controller is in degraded state, the admin queue is up and running but not the IO queues. In this state it's still possible to issues admin commands to the controller to mitigate the problem, that's whay the controller is allowed to enter the LIVE state. Though by filtering out all error, it filters out a connectivity loss event for fabric controllers: 1) nvme nvme10: NVME-FC{10}: create association : ... 2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Re= connect nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd 3) nvme nvme10: Could not set queue count (880) nvme nvme10: Failed to configure AEN (cfg 900) 4) nvme nvme10: NVME-FC{10}: controller connect complete 5) nvme nvme10: failed nvme_keep_alive_end_io error=3D4 A new connection attempt is started 1) and while connecting the host receives a connectivity loss event 2). 3) is the point where the connect code observes a problem but ignores it and enters LIVE state at 4). The keep alive command eventually times out 5) but again, this type of error is ignored. Note the status in nvme_keep_alive_end_io is EINTR, the real reason got lost at this point (connectivity loss). Thus catch the error early where we still have the exact reason why the nvme_set_features has failed and bail out from there. Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count") Signed-off-by: Daniel Wagner Reviewed-by: Christoph Hellwig --- drivers/nvme/host/core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 84cb859a911d09dbe71b2f1ac473ae687c4dc687..cc5ed6daf61f6cbc6fdf7b48687= e25225bfd9f17 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1664,7 +1664,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int= *count) =20 status =3D nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0, &result); - if (status < 0) + /* + * It's either a kernel error or the host observed a connection + * lost. In either case it's not possible communicate with the + * controller and thus enter the error code path. + */ + if (status < 0 || status =3D=3D NVME_SC_HOST_PATH_ERROR) return status; =20 /* --=20 2.47.0