From nobody Thu Apr 2 15:36:29 2026 Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 60F7EB640 for ; Sat, 28 Mar 2026 00:46:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774658774; cv=none; b=S2SJDpFBwHq9idOfkNzJuTT6uI3rtQvWPmwoJMlVHqKlD+tIrbvzT5MhP8k8jbfIhV2iKZMYgEAXq9TkGCk4uG8jDa6KTwOTZir9YhDaRsdYFJ5mJCq8gqcIT2T6qwKfs/QbynpdibOceIwuXIbAF0S58gGTPpRJ07H+Etl+G54= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774658774; c=relaxed/simple; bh=uttmUSzuQZfkVe81YYDNVfzLd7NOuOmOxjSzLe90qZE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Jdc1r8n5MKSJPyCRPZqj1io11w0Eqz8Ew9Ik2qeeKsMQrDx+cHykrNtVY4/cvF3CIHqaGy9kCtlMUQMH8NyQITgPclZQ/0Vxe8Y7CbGmv3OQ665vUmeSLXVfJIjtKtVx+zicU8p3K1A7DTVngKUcM9ycGgCaSUvyEkaFqSEH6eQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=GZuW/bz7; arc=none smtp.client-ip=209.85.215.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="GZuW/bz7" Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-c73a5473bbdso1130478a12.2 for ; Fri, 27 Mar 2026 17:46:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1774658767; x=1775263567; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HoxD5uMALorP/CUpuT5YOO2jyxxdHWPhu3WHKWZf4sY=; b=GZuW/bz78jHUdYU4UXr1dOgiSHXSZ0mcC3+6TDgnIlGA6BSdsu+EHJKgzx3uys21NS cM3gSN2j8ad0nueZJoglswgMalaq1ln+WE2/UPMusd9W+U1UlXY3RQHnhL6xB1+ip6c0 qdyoRdJmSnQwUfgo9vsafQdyC1VTQQYWAXGkSd38zV4Ks3H/L/KbGg7msLKdsMVfyENV rxNkblVuFd0DRUvshOnHefauIzSNWKPP2he6Qs2PyCnAQDjxDK37Ad6mAvFe9EXLd5m9 uPq/VJWegMZQqpUF27j3fjCUY2txpbAw4OlwTJj4Sn410Q60lGSoP6wleY8cMRBex+j3 gV5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774658767; x=1775263567; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=HoxD5uMALorP/CUpuT5YOO2jyxxdHWPhu3WHKWZf4sY=; b=ZLWr7yvYTrS9Ocs6gCa4eIcnTE83YnDak1Ib/fSk/eJk4SPW26icui/bLTMx3xJUx4 i+COUYPWXJ9/W2S6FazWmMJVQgM506nUu/iivrXJApma9FZ6Z/o1mYdp/EBy8N8VFTdK TcpZY+SrD9Mbvwl+FVBJmXoc3TxJM31hdT40afHkw+ve+CpkWpQgt2ZIvw+jgd3TomOu ANGNesdqTDQUk8bftFb6jBMfgXNYVTaVsYdThX9zb/RDSviCnvU1zDa5dRIZpHwFkpZS Wj+o8CaftRribY1tqhDLpM2FyJiVxvgophVjxq2eSb/l8nVMacI2A60EzRTxbvLVqap2 tq2w== X-Forwarded-Encrypted: i=1; AJvYcCVp14CCDpBDGixffFwOtXWvXkkbQ+YVvpTFXGm9XGsZPSgFZ43SY1VvjXjOLhMkAb9XPvRlY4i7tBvI+lc=@vger.kernel.org X-Gm-Message-State: AOJu0YxetBCOqIMIjX/QI5ZwM1/KINfXRDce5HQ04P8mR3liJep4ANih caGDN8O7gEBxFn5GHpqarcsi9QWClGeaLtIL90M7aTswSlhPKw2+eKyE4Yzj05htYH4= X-Gm-Gg: ATEYQzxXrrubDFO5VHoxH85T7cmZiVD1L+CgDuVJWpidYXdrhfw/FDEcbp0QiNDngbU Q/MBSY4KErU/dGjITlmBVHvKOIHI3zBDNYec81No3pDUQ7mOeV7JziQuKMJouIeMxj59XrqAwP+ 1PHULndGb7cEmr0PEKH2OtC89/0xIcvIiV9aD/Vt5RoFtQWApP7MZHqEgYYMyMMBzwfVxNpwX8T BmLZiqLSUR3M6Pq89kcCUJ8hmBLN1ribStBfVj+0fK7dL01sWblmP5xDUStWBfD6DC5ZTSVeOAz s7JuTPbngsPeQ4eKb0dn0M+JtvuwqytNeIxmGtUc4cDRNmMBqPYQFOrpmN5+dUAP5zyXw6tKPxM kcAM6Em38tgKv68RibNB9PomVDzQQXRnKVhB3K/D5RukxHK8xDjJGf7EIfJJhRSMIwYimstXNdl U7CLCI7Rw= X-Received: by 2002:a17:903:2405:b0:2b0:62dd:3a93 with SMTP id d9443c01a7336-2b0cdc0f3f7mr47700325ad.7.1774658766541; Fri, 27 Mar 2026 17:46:06 -0700 (PDT) Received: from ceto ([2601:640:8202:6fb0::9c63]) by smtp.googlemail.com with ESMTPSA id d9443c01a7336-2b242683064sm5342705ad.33.2026.03.27.17.46.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Mar 2026 17:46:06 -0700 (PDT) From: Mohamed Khalfella To: Justin Tee , Naresh Gottumukkala , Paul Ely , Chaitanya Kulkarni , Jens Axboe , Keith Busch , Sagi Grimberg , James Smart , Hannes Reinecke Cc: Aaron Dailey , Randy Jennings , Dhaval Giani , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, Mohamed Khalfella Subject: [PATCH v4 11/15] nvme-rdma: Use CCR to recover controller that hits an error Date: Fri, 27 Mar 2026 17:43:42 -0700 Message-ID: <20260328004518.1729186-12-mkhalfella@purestorage.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260328004518.1729186-1-mkhalfella@purestorage.com> References: <20260328004518.1729186-1-mkhalfella@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" An alive nvme controller that hits an error now will move to FENCING state instead of RESETTING state. ctrl->fencing_work attempts CCR to terminate inflight IOs. Regardless of the success or failure of CCR operation the controller is transitioned to RESETTING state to continue error recovery process. Signed-off-by: Mohamed Khalfella --- drivers/nvme/host/rdma.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 57111139e84f..b42798781619 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -106,6 +106,7 @@ struct nvme_rdma_ctrl { =20 /* other member variables */ struct blk_mq_tag_set tag_set; + struct work_struct fencing_work; struct work_struct err_work; =20 struct nvme_rdma_qe async_event_sqe; @@ -1120,11 +1121,28 @@ static void nvme_rdma_reconnect_ctrl_work(struct wo= rk_struct *work) nvme_rdma_reconnect_or_remove(ctrl, ret); } =20 +static void nvme_rdma_fencing_work(struct work_struct *work) +{ + struct nvme_rdma_ctrl *rdma_ctrl =3D container_of(work, + struct nvme_rdma_ctrl, fencing_work); + struct nvme_ctrl *ctrl =3D &rdma_ctrl->ctrl; + int ret; + + ret =3D nvme_fence_ctrl(ctrl); + if (ret) + dev_info(ctrl->device, "CCR failed with error %d\n", ret); + + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED); + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + queue_work(nvme_reset_wq, &rdma_ctrl->err_work); +} + static void nvme_rdma_error_recovery_work(struct work_struct *work) { struct nvme_rdma_ctrl *ctrl =3D container_of(work, struct nvme_rdma_ctrl, err_work); =20 + flush_work(&ctrl->fencing_work); nvme_stop_keep_alive(&ctrl->ctrl); flush_work(&ctrl->ctrl.async_event_work); nvme_rdma_teardown_io_queues(ctrl, false); @@ -1147,6 +1165,12 @@ static void nvme_rdma_error_recovery_work(struct wor= k_struct *work) =20 static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl) { + if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) { + dev_warn(ctrl->ctrl.device, "starting controller fencing\n"); + queue_work(nvme_wq, &ctrl->fencing_work); + return; + } + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) return; =20 @@ -1957,13 +1981,15 @@ static enum blk_eh_timer_return nvme_rdma_timeout(s= truct request *rq) struct nvme_rdma_ctrl *ctrl =3D queue->ctrl; struct nvme_command *cmd =3D req->req.cmd; int qid =3D nvme_rdma_queue_idx(queue); + enum nvme_ctrl_state state; =20 dev_warn(ctrl->ctrl.device, "I/O tag %d (%04x) opcode %#x (%s) QID %d timeout\n", rq->tag, nvme_cid(rq), cmd->common.opcode, nvme_fabrics_opcode_str(qid, cmd), qid); =20 - if (nvme_ctrl_state(&ctrl->ctrl) !=3D NVME_CTRL_LIVE) { + state =3D nvme_ctrl_state(&ctrl->ctrl); + if (state !=3D NVME_CTRL_LIVE && state !=3D NVME_CTRL_FENCING) { /* * If we are resetting, connecting or deleting we should * complete immediately because we may block controller @@ -2169,6 +2195,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_str= uct *work) container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work); int ret; =20 + flush_work(&ctrl->fencing_work); nvme_stop_ctrl(&ctrl->ctrl); nvme_rdma_shutdown_ctrl(ctrl, false); =20 @@ -2281,6 +2308,7 @@ static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(st= ruct device *dev, =20 INIT_DELAYED_WORK(&ctrl->reconnect_work, nvme_rdma_reconnect_ctrl_work); + INIT_WORK(&ctrl->fencing_work, nvme_rdma_fencing_work); INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work); INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work); =20 --=20 2.52.0