From nobody Thu Apr 2 15:41:48 2026 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42EB618C2C for ; Sat, 28 Mar 2026 00:46:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774658765; cv=none; b=PDpI445kDlF7mJBMOO/fb9xx3GHYo3IfFgYcdK2DJuKQm0HhdV1yNju1uCjo9O+H2B1l3aeWPYORxXrut3MFVueLzvzPD1cBctx9rZq/9gvdmVTK0lORYtUQVuFbaUkujAJiEuxZzNn2i17gEc1abEnKuixj7Kob92EAy7EaBak= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774658765; c=relaxed/simple; bh=+EfEKxMzGB5oj4mlMMp3gM4CWJeF4oBx6URQhMBvl/g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RM7GR6hpgKl/nPg5uDq01ML17uCnzPr7V7OFE+9cPIH2IyCTNs5OTsdKM0d2YPlx7L3FxNv/QSEyhU+KXX47/NgJFE+OCH76Jslrk8ivYufEOvAhEk4no2XuN9dqtsDCKG70rufNIX2NM3vsGfOHvgGdC+gHmSTkomK0HvkQEvE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=RpWzgF7L; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="RpWzgF7L" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-2ad617d5b80so18622865ad.1 for ; Fri, 27 Mar 2026 17:46:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1774658761; x=1775263561; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=q8007HwqSMobBfviM6haQwWqU81fF+x1NKc0cN+3Dhk=; b=RpWzgF7LdxLY84qR1UFnI7f8tj3yNy8Hf2GUArrZR5m9EneorYboTyE/jcT1kWC45q mE3MpIusXoOZde0ai914AO2tZukrmAU3xRUD8OFEAMAxTjn1Qpu867z97J8xqw8DR480 Sy3pzvL30AI5vDodd9BfrRhYqrt5IjtYucl8ARXhkd/7XDH9HiUbcxaOTgE01Q8Y8XsG pnTvA2wtwEIgQqEU6jo89FzkBrJHSdmvgz/WjOx2n6bT4i0c5B0T7BjngQkQYEn1Fxxw KFzbJ34vU825v6GuNNtKGtLkkCEDOgaOAFBMqJPAtEqghRQxaoQbt4JlUenbt78yDStD Gtwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774658761; x=1775263561; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=q8007HwqSMobBfviM6haQwWqU81fF+x1NKc0cN+3Dhk=; b=hOE0JAXTTMHlsRNpHp4Lu1fptxieOvZjYOnPe29SmOHthi74pLzZiouflF4Xl1trJu GoY40OfmrdJCLceCdmKCDG1xvZknP6+Y2JuCeraZvRAJtrH/UXIq2+UXIO/zXxINuPIj MaAMUlpCYz9XOCPs7s5p7H7nDBXGhvAGWa8BGlsTM/7HCoXPW4FpVbYaQ4MEEmK0nIXh nt5l77fydsvwlOO3xPAqK9sA02hbxv1aabYZCqWB5qNtRMe2W0kQ6QPsEkRooiq8bP62 op8ZLex6R5wR3KDLYKbfz7wpeTb/1GXwg/VAvHsLsLwrNDstBBmqmsQhC38buD+VtA2c NQ6A== X-Forwarded-Encrypted: i=1; AJvYcCVatub2AgfyAV20pD1NfYUlE3d1FwhuWPecwgam470RnISer9869AYDo734274sPQIdbD7oGu/4TRZrosM=@vger.kernel.org X-Gm-Message-State: AOJu0Yyeopnw155t/1A7qmyZi7av19m7rnRg11CxXXs3k68+thN91bK7 JUIMARK10HQDq4pMmDklQqbDApnEfjtjIWDm8UyNpIpH/0i2gKzizZ8Yyt2BLLVBuys= X-Gm-Gg: ATEYQzy7aaVjimoddi0beDtM06k37UgZNTJp8R7no1Nkhv/TYM1dffBpq0J4cIhwohu OnRoKQpTkquwZC8dDgnXJ6Xf2RFAmA21CcsW8pJPSDDE95Er1259O/hCT0JgR3bkA/PbqKvrLs5 sCa/W1Qn+w8MqCYtfVz0RRc6uANmFStIDIogfTK5q3wL09Hp4Vc7/acYIwwM46HENyc2tAw4KxZ Yhi6hab4CSWPUEC+DURcTdtFZXGo6Iljr97m8zR0sRb3NujLAkWdT4RQNW+DnpLMm1psbV+BYr0 cL6Qa8wlNx7AfhSTovJdviSZbzSWikbiXwyfL2Su7FLH67v7uvMjZzorHQ2sTsj/0eeAYVtgzoN 0qRoOXu5m0FlzVXeNEO6qFOUemTMHtEhVnnbpHbXVLPurDG8Ss7kiUD62fnP80hkECKAxs61fHi yrgr+yF3I= X-Received: by 2002:a17:902:da91:b0:2b0:5ec1:97c1 with SMTP id d9443c01a7336-2b0cdc238f3mr49523395ad.7.1774658761285; Fri, 27 Mar 2026 17:46:01 -0700 (PDT) Received: from ceto ([2601:640:8202:6fb0::9c63]) by smtp.googlemail.com with ESMTPSA id d9443c01a7336-2b242683064sm5342705ad.33.2026.03.27.17.45.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Mar 2026 17:46:00 -0700 (PDT) From: Mohamed Khalfella To: Justin Tee , Naresh Gottumukkala , Paul Ely , Chaitanya Kulkarni , Jens Axboe , Keith Busch , Sagi Grimberg , James Smart , Hannes Reinecke Cc: Aaron Dailey , Randy Jennings , Dhaval Giani , linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, Mohamed Khalfella Subject: [PATCH v4 08/15] nvme: Implement cross-controller reset recovery Date: Fri, 27 Mar 2026 17:43:39 -0700 Message-ID: <20260328004518.1729186-9-mkhalfella@purestorage.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260328004518.1729186-1-mkhalfella@purestorage.com> References: <20260328004518.1729186-1-mkhalfella@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A host that has more than one path connecting to an nvme subsystem typically has an nvme controller associated with every path. This is mostly applicable to nvmeof. If one path goes down, inflight IOs on that path should not be retried immediately on another path because this could lead to data corruption as described in TP4129. TP8028 defines cross-controller reset mechanism that can be used by host to terminate IOs on the failed path using one of the remaining healthy paths. Only after IOs are terminated, or long enough time passes as defined by TP4129, inflight IOs should be retried on another path. Implement core cross-controller reset shared logic to be used by the transports. Signed-off-by: Mohamed Khalfella --- drivers/nvme/host/constants.c | 1 + drivers/nvme/host/core.c | 145 ++++++++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 9 +++ 3 files changed, 155 insertions(+) diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c index dc90df9e13a2..f679efd5110e 100644 --- a/drivers/nvme/host/constants.c +++ b/drivers/nvme/host/constants.c @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] =3D { [nvme_admin_virtual_mgmt] =3D "Virtual Management", [nvme_admin_nvme_mi_send] =3D "NVMe Send MI", [nvme_admin_nvme_mi_recv] =3D "NVMe Receive MI", + [nvme_admin_cross_ctrl_reset] =3D "Cross Controller Reset", [nvme_admin_dbbuf] =3D "Doorbell Buffer Config", [nvme_admin_format_nvm] =3D "Format NVM", [nvme_admin_security_send] =3D "Security Send", diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 824a1193bec8..5603ae36444f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -554,6 +554,150 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset); =20 +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl, + u32 min_cntlid) +{ + struct nvme_subsystem *subsys =3D ictrl->subsys; + struct nvme_ctrl *ctrl, *sctrl =3D NULL; + unsigned long flags; + + mutex_lock(&nvme_subsystems_lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (ctrl->cntlid < min_cntlid) + continue; + + if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0) + continue; + + spin_lock_irqsave(&ctrl->lock, flags); + if (ctrl->state !=3D NVME_CTRL_LIVE) { + spin_unlock_irqrestore(&ctrl->lock, flags); + atomic_inc(&ctrl->ccr_limit); + continue; + } + + /* + * We got a good candidate source controller that is locked and + * LIVE. However, no guarantee ctrl will not be deleted after + * ctrl->lock is released. Get a ref of both ctrl and admin_q + * so they do not disappear until we are done with them. + */ + WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q)); + nvme_get_ctrl(ctrl); + spin_unlock_irqrestore(&ctrl->lock, flags); + sctrl =3D ctrl; + break; + } + mutex_unlock(&nvme_subsystems_lock); + return sctrl; +} + +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl) +{ + atomic_inc(&sctrl->ccr_limit); + blk_put_queue(sctrl->admin_q); + nvme_put_ctrl(sctrl); +} + +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *= ictrl, + unsigned long deadline) +{ + struct nvme_ccr_entry ccr =3D { }; + union nvme_result res =3D { 0 }; + struct nvme_command c =3D { }; + unsigned long flags, now, tmo =3D 0; + bool completed =3D false; + int ret =3D 0; + u32 result; + + init_completion(&ccr.complete); + ccr.ictrl =3D ictrl; + + spin_lock_irqsave(&sctrl->lock, flags); + list_add_tail(&ccr.list, &sctrl->ccr_list); + spin_unlock_irqrestore(&sctrl->lock, flags); + + c.ccr.opcode =3D nvme_admin_cross_ctrl_reset; + c.ccr.ciu =3D ictrl->ciu; + c.ccr.icid =3D cpu_to_le16(ictrl->cntlid); + c.ccr.cirn =3D cpu_to_le64(ictrl->cirn); + ret =3D __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res, + NULL, 0, NVME_QID_ANY, 0); + if (ret) { + ret =3D -EIO; + goto out; + } + + result =3D le32_to_cpu(res.u32); + if (result & 0x01) /* Immediate Reset Successful */ + goto out; + + now =3D jiffies; + if (time_before(now, deadline)) + tmo =3D min_t(unsigned long, + secs_to_jiffies(ictrl->kato), deadline - now); + + if (!wait_for_completion_timeout(&ccr.complete, tmo)) { + ret =3D -ETIMEDOUT; + goto out; + } + + completed =3D true; + +out: + spin_lock_irqsave(&sctrl->lock, flags); + list_del(&ccr.list); + spin_unlock_irqrestore(&sctrl->lock, flags); + if (completed) { + if (ccr.ccrs =3D=3D NVME_CCR_STATUS_SUCCESS) + return 0; + return -EREMOTEIO; + } + return ret; +} + +int nvme_fence_ctrl(struct nvme_ctrl *ictrl) +{ + unsigned long deadline, timeout; + struct nvme_ctrl *sctrl; + u32 min_cntlid =3D 0; + int ret; + + timeout =3D nvme_fence_timeout_ms(ictrl); + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout); + + deadline =3D jiffies + msecs_to_jiffies(timeout); + while (time_is_after_jiffies(deadline)) { + sctrl =3D nvme_find_ctrl_ccr(ictrl, min_cntlid); + if (!sctrl) { + dev_dbg(ictrl->device, + "failed to find source controller\n"); + return -EIO; + } + + ret =3D nvme_issue_wait_ccr(sctrl, ictrl, deadline); + if (!ret) { + dev_info(ictrl->device, "CCR succeeded using %s\n", + dev_name(sctrl->device)); + nvme_put_ctrl_ccr(sctrl); + return 0; + } + + min_cntlid =3D sctrl->cntlid + 1; + nvme_put_ctrl_ccr(sctrl); + + if (ret =3D=3D -EIO) /* CCR command failed */ + continue; + + /* CCR operation failed or timed out */ + return ret; + } + + dev_info(ictrl->device, "CCR operation timeout\n"); + return -ETIMEDOUT; +} +EXPORT_SYMBOL_GPL(nvme_fence_ctrl); + bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, enum nvme_ctrl_state new_state) { @@ -5116,6 +5260,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct dev= ice *dev, =20 mutex_init(&ctrl->scan_lock); INIT_LIST_HEAD(&ctrl->namespaces); + INIT_LIST_HEAD(&ctrl->ccr_list); xa_init(&ctrl->cels); ctrl->dev =3D dev; ctrl->ops =3D ops; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 45e58434cf30..f2bcff9ccd25 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -333,6 +333,13 @@ enum nvme_ctrl_flags { NVME_CTRL_FROZEN =3D 6, }; =20 +struct nvme_ccr_entry { + struct list_head list; + struct completion complete; + struct nvme_ctrl *ictrl; + u8 ccrs; +}; + struct nvme_ctrl { bool comp_seen; bool identified; @@ -350,6 +357,7 @@ struct nvme_ctrl { struct blk_mq_tag_set *tagset; struct blk_mq_tag_set *admin_tagset; struct list_head namespaces; + struct list_head ccr_list; struct mutex namespaces_lock; struct srcu_struct srcu; struct device ctrl_device; @@ -868,6 +876,7 @@ blk_status_t nvme_host_path_error(struct request *req); bool nvme_cancel_request(struct request *req, void *data); void nvme_cancel_tagset(struct nvme_ctrl *ctrl); void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl); +int nvme_fence_ctrl(struct nvme_ctrl *ctrl); bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, enum nvme_ctrl_state new_state); int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown); --=20 2.52.0