From nobody Mon Jun 8 14:35:53 2026 Received: from mail-ot1-f100.google.com (mail-ot1-f100.google.com [209.85.210.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 451531D555 for ; Fri, 29 May 2026 00:14:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.100 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780013660; cv=none; b=fclzDViMzhaQfAvt75+JgyGZ7ftcvOJAivPNgbFzWRffVrSOXmFPyxc5AJ9rxnnOnc0b1sQZLtdoUo9E84h2vpjRwR67xi95Jaa68209HdT0HIW2yQ40bGagE6Ga69FoRw7nbNzX8kPHnZPouOtOOWSd6h4rNpexWCWANe6cWq8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780013660; c=relaxed/simple; bh=L9fzaAeHcZVQHOrfMuFheV/ouPlzWL2eYhAp066SemQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=OUvVD+Tek4I0d1EKZIjf12658I6xgat/ZIoKm7VPRElFssDlkHZa986gDHqtZ2ME5Jbo9pRC0RYucqUYDv/x5n5Esn5KKs1NSYC5u17KC3SYZOEpS4HE7nZAhbq/OTzSkBcxKOpgEAb7nXDAYSDI4ExIj5Q548vsKenpEwgpNd4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com; spf=pass smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=LNVOdIfg; arc=none smtp.client-ip=209.85.210.100 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="LNVOdIfg" Received: by mail-ot1-f100.google.com with SMTP id 46e09a7af769-7dd73b7c757so6241246a34.0 for ; Thu, 28 May 2026 17:14:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1780013657; x=1780618457; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=h4NRSqy35jt5KU4/CdP7Ksm0GS/daEYEZ96rcO4Wdkw=; b=LNVOdIfgc+HGWS8wRC/mUvXm73TgpbIBViWlynMfvepPW3lDqO58HMy1/ifEKHKWLL FRLk8C2dhN5POJLYTyKE296zzN8Kh6qDgr4mq/2rhM2+GAAkIjYf61+LXxK3rhF+aTrh X4BlZd5q7Pu0VKdtd8VeAJ7LnILBRusrQEWMMhvyOVDpMa2guOYKfGRAoThnbyRcXjKf Nkk37hhNOGmNe6gemO1AUhi2WlVGoVQnEC+WMC4Oq4mlrlEag4AsaEcsiMXum+KCx9A9 +KfdxbMQgvypLmwlRtG8uryIrngP21N2XXR27PLMvpCo9nrTUWzogD54Iqo9GuZHxA4o 4/Fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780013657; x=1780618457; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=h4NRSqy35jt5KU4/CdP7Ksm0GS/daEYEZ96rcO4Wdkw=; b=EaBZIgKn6yOjGBr947T04Dah8iOy+3Ny5UttWovsZTpn5aCCWwfNGzIxGnPPV3FEcy QEyXyTr+b5mvNlqB6umwImdDwgTK3/1i9wc6MeDebThQFKNpc6ypk4b64uK8cpow8b84 wpupUQofX9nLRSevFkz7V8JPP1sc9P4duYFpSAyz8+DNQwQxiAWB2LmQXnip8l5slCNt R7WoMHXQYBTPRzK91ub4HxSq80fsiao9Qp1ms4htEPRH6hVvvsG8g6qQ+eML2DdE/RaB /mjBOURprsR4Y7h9iUdOMW8uoSUpYnqGAfrY5wm3U++0YW7pGXas52V/jViaAoVCSWYQ tVgw== X-Forwarded-Encrypted: i=1; AFNElJ+Z5AXpH5rpWq+s2oqrQ4FpecdmfH4TF5T8exTvIy+54Pm4QuvKMMUR5NIor5VcvC4sO/ih1Y+w22AKtb8=@vger.kernel.org X-Gm-Message-State: AOJu0YznIM6ThDZHZD5c2t4cxzd5aNp32DTz5gHQYXAmzNxxX3Xz07Gh hMCSiuKerot+VTVcEYOc9VD5ix9jCRsp/YUIFUKBGbPAPAbxZneUSABCXKC9yfg0ViPOdc6xo8V DtA5mva/MCODU20geQcpt1QZDqQURU1SUqKsCkiGh4LuxJEUlB5Ws X-Gm-Gg: Acq92OE0wviA64r0R8eeI/ELDlIIQjDh97UEHUawu40O+U/4ZkcefznTqshSzZnOps8 qhS9sL5EUb3tlHjVeFrf8EWXdRqmdR7IA/S7YEI7p0rsHs1KyGRbY6Kw+h5il0TqX8/f0/zUT42 FePpr8yNiq1MA0Ma7EpfMqJc8RYJiPOimUUsqTYGni+wKMsXIdm07yHrUG6B67XUpaPVxMG4Sap i8+PKgkd5+iOqk+gSxVt4dfzWZNbxx2cckXr1twlCxd3n/1NlMgjGw77lZUuqZA73IdQQSgxVHJ +CbRansUVofiNdYN//oq9HRUmRqiGdJ38jCH+tjChYUu9UDz1n4wLgrqeLIRekGH19KYz3YZWMs i3am0Fwx5KHaJGApItaIChC6lhAtvM0GLvh53EyF4ckxJ421C X-Received: by 2002:a05:6820:134f:b0:69d:7ce9:ac52 with SMTP id 006d021491bc7-69e03f0f7cfmr423899eaf.16.1780013657087; Thu, 28 May 2026 17:14:17 -0700 (PDT) Received: from c7-smtp-2023.dev.purestorage.com ([2620:125:9017:12:36:3:5:0]) by smtp-relay.gmail.com with ESMTPS id 006d021491bc7-69e03f9ebe1sm35768eaf.3.2026.05.28.17.14.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 May 2026 17:14:17 -0700 (PDT) X-Relaying-Domain: purestorage.com Received: from dev-sgogte.dev.purestorage.com (dev-sgogte.dev.purestorage.com [10.112.19.91]) by c7-smtp-2023.dev.purestorage.com (Postfix) with ESMTP id 5827A3401A2; Thu, 28 May 2026 18:14:16 -0600 (MDT) Received: by dev-sgogte.dev.purestorage.com (Postfix, from userid 1557734945) id 4664A5122F; Thu, 28 May 2026 18:14:16 -0600 (MDT) From: Surabhi Gogte To: kbusch@kernel.org, axboe@kernel.dk, hch@lst.de, sagi@grimberg.me Cc: linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, mkhalfella@purestorage.com, randyj@purestorage.com, Surabhi Gogte Subject: [PATCH] nvme-rdma: parallelize I/O queue allocation and startup Date: Thu, 28 May 2026 18:13:54 -0600 Message-ID: <20260529001354.1003640-1-sgogte@purestorage.com> X-Mailer: git-send-email 2.54.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Refactor nvme rdma I/O queue setup to use parallel work queues, combining allocation and startup into a single parallel operation per queue. This reduces connection and reconnection setup time when there are delays in establishing connections, which is especially important for high-core-count hosts. Key changes: - Use a dedicated nvme_setup_wq for per-queue setup work instead of nvme_wq or nvme_reset_wq. Since reconnect and reset paths run as workers on those queues, calling flush_work() on setup items queued to the same workqueue would trigger lockdep's recursive-flush detection and risk deadlock under worker-pool exhaustion. - Add setup_work and queue_idx fields to nvme_rdma_queue to enable per-queue workqueue dispatch; add qsetup_err atomic to nvme_rdma_ctrl for collecting errors across parallel workers. - Refactor nvme_rdma_alloc_queue() to accept a pre-initialized queue pointer instead of (ctrl, idx, queue_size), updating all call sites including the admin queue path. - Remove nvme_rdma_alloc_io_queues() and nvme_rdma_start_io_queues(); their logic is folded into nvme_rdma_setup_io_queues() and nvme_rdma_configure_io_queues(). - Move queue count negotiation (nvme_set_queue_count, nvmf_set_io_queues) from the removed nvme_rdma_alloc_io_queues() into nvme_rdma_configure_io_queues(). - Introduce nvme_rdma_setup_io_queues() to dispatch alloc+start per queue in parallel and collect errors atomically. Testing on a 64-core host with 64 IO-queues shows nvme-rdma connection time reduced from ~1.4s to 416ms. Signed-off-by: Surabhi Gogte --- drivers/nvme/host/core.c | 15 ++++- drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/rdma.c | 127 ++++++++++++++++++++++----------------- 3 files changed, 86 insertions(+), 57 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dc388e24caad..f1247a61c797 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -107,7 +107,8 @@ MODULE_PARM_DESC(disable_pi_offsets, "disable protection information if it has an offset"); =20 /* - * nvme_wq - hosts nvme related works that are not reset or delete + * nvme_wq - hosts nvme related works that are not queue setup, reset or d= elete + * nvme_setup_wq - hosts nvme queue setup works * nvme_reset_wq - hosts nvme reset works * nvme_delete_wq - hosts nvme delete works * @@ -126,6 +127,9 @@ EXPORT_SYMBOL_GPL(nvme_reset_wq); struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); =20 +struct workqueue_struct *nvme_setup_wq; +EXPORT_SYMBOL_GPL(nvme_setup_wq); + static LIST_HEAD(nvme_subsystems); DEFINE_MUTEX(nvme_subsystems_lock); =20 @@ -5415,10 +5419,14 @@ static int __init nvme_core_init(void) if (!nvme_delete_wq) goto destroy_reset_wq; =20 + nvme_setup_wq =3D alloc_workqueue("nvme-setup-wq", wq_flags, 0); + if (!nvme_setup_wq) + goto destroy_delete_wq; + result =3D alloc_chrdev_region(&nvme_ctrl_base_chr_devt, 0, NVME_MINORS, "nvme"); if (result < 0) - goto destroy_delete_wq; + goto destroy_setup_wq; =20 result =3D class_register(&nvme_class); if (result) @@ -5452,6 +5460,8 @@ static int __init nvme_core_init(void) class_unregister(&nvme_class); unregister_chrdev: unregister_chrdev_region(nvme_ctrl_base_chr_devt, NVME_MINORS); +destroy_setup_wq: + destroy_workqueue(nvme_setup_wq); destroy_delete_wq: destroy_workqueue(nvme_delete_wq); destroy_reset_wq: @@ -5470,6 +5480,7 @@ static void __exit nvme_core_exit(void) class_unregister(&nvme_class); unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS); unregister_chrdev_region(nvme_ctrl_base_chr_devt, NVME_MINORS); + destroy_workqueue(nvme_setup_wq); destroy_workqueue(nvme_delete_wq); destroy_workqueue(nvme_reset_wq); destroy_workqueue(nvme_wq); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ccd5e05dac98..9cc49f5ad8af 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -49,6 +49,7 @@ extern unsigned int admin_timeout; extern struct workqueue_struct *nvme_wq; extern struct workqueue_struct *nvme_reset_wq; extern struct workqueue_struct *nvme_delete_wq; +extern struct workqueue_struct *nvme_setup_wq; extern struct mutex nvme_subsystems_lock; =20 /* diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index f77c960f7632..b2db43bd6e84 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -98,6 +98,8 @@ struct nvme_rdma_queue { bool pi_support; int cq_size; struct mutex queue_lock; + struct work_struct setup_work; + int queue_idx; }; =20 struct nvme_rdma_ctrl { @@ -125,6 +127,7 @@ struct nvme_rdma_ctrl { struct nvme_ctrl ctrl; bool use_inline_data; u32 io_queues[HCTX_MAX_TYPES]; + atomic_t qsetup_err; }; =20 static inline struct nvme_rdma_ctrl *to_rdma_ctrl(struct nvme_ctrl *ctrl) @@ -566,16 +569,14 @@ static int nvme_rdma_create_queue_ib(struct nvme_rdma= _queue *queue) return ret; } =20 -static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl, - int idx, size_t queue_size) +static int nvme_rdma_alloc_queue(struct nvme_rdma_queue *queue) { - struct nvme_rdma_queue *queue; + struct nvme_rdma_ctrl *ctrl =3D queue->ctrl; + int idx =3D nvme_rdma_queue_idx(queue); struct sockaddr *src_addr =3D NULL; int ret; =20 - queue =3D &ctrl->queues[idx]; mutex_init(&queue->queue_lock); - queue->ctrl =3D ctrl; if (idx && ctrl->ctrl.max_integrity_segments) queue->pi_support =3D true; else @@ -587,8 +588,6 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl = *ctrl, else queue->cmnd_capsule_len =3D sizeof(struct nvme_command); =20 - queue->queue_size =3D queue_size; - queue->cm_id =3D rdma_create_id(&init_net, nvme_rdma_cm_handler, queue, RDMA_PS_TCP, IB_QPT_RC); if (IS_ERR(queue->cm_id)) { @@ -694,59 +693,59 @@ static int nvme_rdma_start_queue(struct nvme_rdma_ctr= l *ctrl, int idx) return ret; } =20 -static int nvme_rdma_start_io_queues(struct nvme_rdma_ctrl *ctrl, - int first, int last) +static void nvme_rdma_setup_queue_work(struct work_struct *work) { - int i, ret =3D 0; + struct nvme_rdma_queue *queue =3D + container_of(work, struct nvme_rdma_queue, setup_work); + int ret; =20 - for (i =3D first; i < last; i++) { - ret =3D nvme_rdma_start_queue(ctrl, i); - if (ret) - goto out_stop_queues; - } + ret =3D nvme_rdma_alloc_queue(queue); + if (ret) + goto out_err; =20 - return 0; + ret =3D nvme_rdma_start_queue(queue->ctrl, queue->queue_idx); + if (ret) + goto out_err; =20 -out_stop_queues: - for (i--; i >=3D first; i--) - nvme_rdma_stop_queue(&ctrl->queues[i]); - return ret; + return; + +out_err: + atomic_cmpxchg(&queue->ctrl->qsetup_err, 0, ret); } =20 -static int nvme_rdma_alloc_io_queues(struct nvme_rdma_ctrl *ctrl) +static int nvme_rdma_setup_io_queues(struct nvme_rdma_ctrl *ctrl, int firs= t, + int last, size_t queue_size) { - struct nvmf_ctrl_options *opts =3D ctrl->ctrl.opts; - unsigned int nr_io_queues; - int i, ret; + int nr_queues =3D last - first; + int ret, i; =20 - nr_io_queues =3D nvmf_nr_io_queues(opts); - ret =3D nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues); - if (ret) - return ret; + atomic_set(&ctrl->qsetup_err, 0); =20 - if (nr_io_queues =3D=3D 0) { - dev_err(ctrl->ctrl.device, - "unable to set any I/O queues\n"); - return -ENOMEM; - } - - ctrl->ctrl.queue_count =3D nr_io_queues + 1; - dev_info(ctrl->ctrl.device, - "creating %d I/O queues.\n", nr_io_queues); + for (i =3D 0; i < nr_queues; i++) { + struct nvme_rdma_queue *queue =3D &ctrl->queues[first + i]; =20 - nvmf_set_io_queues(opts, nr_io_queues, ctrl->io_queues); - for (i =3D 1; i < ctrl->ctrl.queue_count; i++) { - ret =3D nvme_rdma_alloc_queue(ctrl, i, - ctrl->ctrl.sqsize + 1); - if (ret) - goto out_free_queues; + queue->ctrl =3D ctrl; + queue->queue_idx =3D first + i; + queue->queue_size =3D queue_size; + INIT_WORK(&queue->setup_work, nvme_rdma_setup_queue_work); + queue_work(nvme_setup_wq, &queue->setup_work); } =20 - return 0; + for (i =3D 0; i < nr_queues; i++) + flush_work(&ctrl->queues[first + i].setup_work); =20 -out_free_queues: - for (i--; i >=3D 1; i--) - nvme_rdma_free_queue(&ctrl->queues[i]); + ret =3D atomic_read(&ctrl->qsetup_err); + if (ret) { + for (i =3D 0; i < nr_queues; i++) { + struct nvme_rdma_queue *queue =3D + &ctrl->queues[first + i]; + + if (test_bit(NVME_RDMA_Q_LIVE, &queue->flags)) + nvme_rdma_stop_queue(queue); + if (test_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags)) + nvme_rdma_free_queue(queue); + } + } =20 return ret; } @@ -783,7 +782,9 @@ static int nvme_rdma_configure_admin_queue(struct nvme_= rdma_ctrl *ctrl, bool pi_capable =3D false; int error; =20 - error =3D nvme_rdma_alloc_queue(ctrl, 0, NVME_AQ_DEPTH); + ctrl->queues[0].ctrl =3D ctrl; + ctrl->queues[0].queue_size =3D NVME_AQ_DEPTH; + error =3D nvme_rdma_alloc_queue(&ctrl->queues[0]); if (error) return error; =20 @@ -864,11 +865,22 @@ static int nvme_rdma_configure_admin_queue(struct nvm= e_rdma_ctrl *ctrl, static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool= new) { int ret, nr_queues; + unsigned int nr_io_queues; =20 - ret =3D nvme_rdma_alloc_io_queues(ctrl); + nr_io_queues =3D nvmf_nr_io_queues(ctrl->ctrl.opts); + ret =3D nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues); if (ret) return ret; =20 + if (nr_io_queues =3D=3D 0) { + dev_err(ctrl->ctrl.device, "unable to set any I/O queues\n"); + return -ENOMEM; + } + + ctrl->ctrl.queue_count =3D nr_io_queues + 1; + dev_info(ctrl->ctrl.device, "creating %d I/O queues.\n", nr_io_queues); + nvmf_set_io_queues(ctrl->ctrl.opts, nr_io_queues, ctrl->io_queues); + if (new) { ret =3D nvme_rdma_alloc_tag_set(&ctrl->ctrl); if (ret) @@ -881,7 +893,9 @@ static int nvme_rdma_configure_io_queues(struct nvme_rd= ma_ctrl *ctrl, bool new) * queue number might have changed. */ nr_queues =3D min(ctrl->tag_set.nr_hw_queues + 1, ctrl->ctrl.queue_count); - ret =3D nvme_rdma_start_io_queues(ctrl, 1, nr_queues); + ret =3D nvme_rdma_setup_io_queues(ctrl, 1, nr_queues, + ctrl->ctrl.sqsize + 1); + if (ret) goto out_cleanup_tagset; =20 @@ -905,12 +919,15 @@ static int nvme_rdma_configure_io_queues(struct nvme_= rdma_ctrl *ctrl, bool new) =20 /* * If the number of queues has increased (reconnect case) - * start all new queues now. + * setup all new queues now. */ - ret =3D nvme_rdma_start_io_queues(ctrl, nr_queues, - ctrl->tag_set.nr_hw_queues + 1); - if (ret) - goto out_wait_freeze_timed_out; + if (ctrl->tag_set.nr_hw_queues + 1 > nr_queues) { + ret =3D nvme_rdma_setup_io_queues(ctrl, nr_queues, + ctrl->tag_set.nr_hw_queues + 1, + ctrl->ctrl.sqsize + 1); + if (ret) + goto out_wait_freeze_timed_out; + } =20 return 0; =20 --=20 2.54.0