From nobody Tue Dec 2 02:51:08 2025 Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0019340A67 for ; Mon, 17 Nov 2025 20:24:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763411104; cv=none; b=t6fzcB4l9Cib//F7rPg8nxUK7V4ymy/2cSBop4lwrWyNt8I6Up1gKNLiJFga4JRm3jwW87Y1AH7tHGUPxWufvPuYaqRb55X3sK3w0Q9xosquJWe4CgudSZ8eE8CSYC705yfXOAdO2r6rNczYHk1U1E81QeuuaEP/mbPoxgX4cME= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763411104; c=relaxed/simple; bh=3UDBUKAqRk5CFdQFEYaPWsQgkn5sB9aDlg+wgYeT7Jo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nzwrQvRjpr7JeYfaCM2AWIE//RGN0LD6atkJtxdv+PneH5MGNKEy/BAhEpA0fN2oDl27MJMaa47wuxQbee0Tf7viS1Wlqg8z+4R7e01FVyS40Mu0fwHAzusioA0rgmlXpjA+/J825uYdwz9CXy9rpzH5kPXwXPTApsM5D1a1fMQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com; spf=fail smtp.mailfrom=purestorage.com; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b=SD5eg7VF; arc=none smtp.client-ip=209.85.215.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=purestorage.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=purestorage.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=purestorage.com header.i=@purestorage.com header.b="SD5eg7VF" Received: by mail-pg1-f181.google.com with SMTP id 41be03b00d2f7-bc0e6d91222so3301551a12.2 for ; Mon, 17 Nov 2025 12:24:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1763411085; x=1764015885; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HyF6/V4nS8wYWlGdOSa2UCNoJb0kxdxyGdDvo969ZXQ=; b=SD5eg7VFjJ62WIsjbU+si2vyTjrpppN1sb63C45Q6fpgvURJhFbXaw+6wXupfLwHnF z+g9HMxAOGKj2ZP4HyzqJyBy3tZnOQzmEzEqHsHdL2NPL6F+yBcLyy4F4WUA1HO98IJX bUMscmlwZd/lnu1jgzefePvDMdwvv2oVozR7QztTP4fyuTF+klo9ofTcbqW56MCe3EPr 9QyhJrw4yb720F4c+YyFNL+FQMawVCDor7xY0VdMoHZKC5cOVn2rRVWfnNH9Bz5jfhg3 xomJtyGk9x4B28l9yl5wtn6Jimg3FXu4ywA/t1M5A/OfngvlemRlCfC6jkbN/BfbXuwG pBmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763411085; x=1764015885; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=HyF6/V4nS8wYWlGdOSa2UCNoJb0kxdxyGdDvo969ZXQ=; b=JBuJqKjnxHGycBHZLMQ8ZJXgLglejjY3yGGMyaKnTWJ0XMyQOLFEaK3GjZoXm4PB+t krnkgKgSdWtKXP2Exsggt14VtVee1qyeqHpB9c4xJM6EK/BRo1MaERb9y79KtE5qEhN0 cCbtV5hXnFZ1tKFOAqUdNH3O68Qrm0V1NPboAk66PDHS9DHH1ZuF/aFLl+ngYm7qIY6i Wig1hQQdJPWHgkEdxHE169l7QIcihTETrWdsg80vQKz6nzSs/GgRuQ0/a04JjRS8HcPv Dhs3UKKeXYTloggO1R/hebctxE+IpyhjFrz8VMwp5c4AS9kY1Oy10uSZOqkFXNBGZi5w BWDA== X-Forwarded-Encrypted: i=1; AJvYcCXiU2UaJik6WRLnNgHuzFo2Yem7mkTLEhOm/YydANbiasScD5jKv9Ica0AzN9g9V86fLhF9ka4W93UufkQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwoDiwJ0SVP+f2adc064xfDH5IMY7nQvi3RXnE0nZ5ffj+wOn9I out81VRN0ySZZAMAi+x7LWDkUNN8oSc71QHC6IAl9OsFjaFG7c/z59kLOzxjR7C5iII= X-Gm-Gg: ASbGnct+8Z8u5yg5s28jmUZ5uAG/5CHOF7MBV3RswjYpbyFRdT7InVeudqw9dMSh4kQ uY6JQwP8bLCF3SPTZSGgNUaTPCvvbiRjfiDHoHq+3LKeIZfLvawrOn6IrPCEltolfrg8Jh7yccG WX1imsorxJKlKN3P6MLfL548skKWYvQvx0GBIB1JB5NONLTRXLbYYrwpTeWVEk1dCMTN+NBamm/ vKt/vW5b6BTbJexNJJTSQr2iuWP+BkmlQ3mvCHWQcQttUs91abORzxpY900L3KpbY2LnFkumTLe b/SHsZXnbgAaG/JevcXU1+NrFfHb2sitGIitX3tCBxSBYgg5TDrNy4ZKS0yVabgid3uvV4IiTu+ gvHjarFaXnjVylgGuTdR2eu0GvQvZjqcAhdgedZdDEgiEFDb09lcu5ACDlk6xNg2h96J5YtG/v5 dBRPUXTqXoibWcjx3qgaHllPFz7V4FGyIEZ4on6U9s9uuAE1cuc0gSufQ= X-Google-Smtp-Source: AGHT+IGAK8CgmTY+UZx0Sv+IjKS2O7b9jc98OXKXQYt3KWF2GTn3BjqKch0SpUJq0k9Oh0eqhxxLUg== X-Received: by 2002:a05:7300:a987:b0:2a4:3593:6464 with SMTP id 5a478bee46e88-2a4abb1c750mr7667244eec.20.1763411084735; Mon, 17 Nov 2025 12:24:44 -0800 (PST) Received: from apollo.purestorage.com ([208.88.152.253]) by smtp.googlemail.com with ESMTPSA id 5a478bee46e88-2a49db7a753sm49298281eec.6.2025.11.17.12.24.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Nov 2025 12:24:44 -0800 (PST) From: Mohamed Khalfella To: Jens Axboe , Keith Busch , Sagi Grimberg , Chaitanya Kulkarni Cc: Casey Chen , Vikas Manocha , Yuanyuan Zhong , Hannes Reinecke , Ming Lei , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Mohamed Khalfella Subject: [PATCH v2 1/1] nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock Date: Mon, 17 Nov 2025 12:23:53 -0800 Message-ID: <20251117202414.4071380-2-mkhalfella@purestorage.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251117202414.4071380-1-mkhalfella@purestorage.com> References: <20251117202414.4071380-1-mkhalfella@purestorage.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" blk_mq_{add,del}_queue_tag_set() functions add and remove queues from tagset, the functions make sure that tagset and queues are marked as shared when two or more queues are attached to the same tagset. Initially a tagset starts as unshared and when the number of added queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along with all the queues attached to it. When the number of attached queues drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and the remaining queues as unshared. Both functions need to freeze current queues in tagset before setting on unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions hold set->tag_list_lock mutex, which makes sense as we do not want queues to be added or deleted in the process. This used to work fine until commit 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") made the nvme driver quiesce tagset instead of quiscing individual queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in set->tag_list while holding set->tag_list_lock also. This results in deadlock between two threads with these stacktraces: __schedule+0x48e/0xed0 schedule+0x5a/0xc0 schedule_preempt_disabled+0x11/0x20 __mutex_lock.constprop.0+0x3cc/0x760 blk_mq_quiesce_tagset+0x26/0xd0 nvme_dev_disable_locked+0x77/0x280 [nvme] nvme_timeout+0x268/0x320 [nvme] blk_mq_handle_expired+0x5d/0x90 bt_iter+0x7e/0x90 blk_mq_queue_tag_busy_iter+0x2b2/0x590 ? __blk_mq_complete_request_remote+0x10/0x10 ? __blk_mq_complete_request_remote+0x10/0x10 blk_mq_timeout_work+0x15b/0x1a0 process_one_work+0x133/0x2f0 ? mod_delayed_work_on+0x90/0x90 worker_thread+0x2ec/0x400 ? mod_delayed_work_on+0x90/0x90 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x2d/0x50 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 __schedule+0x48e/0xed0 schedule+0x5a/0xc0 blk_mq_freeze_queue_wait+0x62/0x90 ? destroy_sched_domains_rcu+0x30/0x30 blk_mq_exit_queue+0x151/0x180 disk_release+0xe3/0xf0 device_release+0x31/0x90 kobject_put+0x6d/0x180 nvme_scan_ns+0x858/0xc90 [nvme_core] ? nvme_scan_work+0x281/0x560 [nvme_core] nvme_scan_work+0x281/0x560 [nvme_core] process_one_work+0x133/0x2f0 ? mod_delayed_work_on+0x90/0x90 worker_thread+0x2ec/0x400 ? mod_delayed_work_on+0x90/0x90 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x2d/0x50 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 The top stacktrace is showing nvme_timeout() called to handle nvme command timeout. timeout handler is trying to disable the controller and as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not to call queue callback handlers. The thread is stuck waiting for set->tag_list_lock as it tires to walk the queues in set->tag_list. The lock is held by the second thread in the bottom stack which is waiting for one of queues to be frozen. The queue usage counter will drop to zero after nvme_timeout() finishes, and this will not happen because the thread will wait for this mutex forever. Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the semaphore for read since this is enough to guarantee no queues will be added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the semaphore for write while updating set->tag_list and downgrade it to read while freezing the queues. It should be safe to update set->flags and hctx->flags while holding the semaphore for read since the queues are already frozen. Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Signed-off-by: Mohamed Khalfella Reviewed-by: Ming Lei Reviewed-by: Sagi Grimberg --- block/blk-mq-sysfs.c | 10 ++--- block/blk-mq.c | 95 +++++++++++++++++++++++------------------- include/linux/blk-mq.h | 4 +- 3 files changed, 58 insertions(+), 51 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 58ec293373c6..f474781654fb 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -230,13 +230,13 @@ int blk_mq_sysfs_register(struct gendisk *disk) =20 kobject_uevent(q->mq_kobj, KOBJ_ADD); =20 - mutex_lock(&q->tag_set->tag_list_lock); + down_read(&q->tag_set->tag_list_rwsem); queue_for_each_hw_ctx(q, hctx, i) { ret =3D blk_mq_register_hctx(hctx); if (ret) goto out_unreg; } - mutex_unlock(&q->tag_set->tag_list_lock); + up_read(&q->tag_set->tag_list_rwsem); return 0; =20 out_unreg: @@ -244,7 +244,7 @@ int blk_mq_sysfs_register(struct gendisk *disk) if (j < i) blk_mq_unregister_hctx(hctx); } - mutex_unlock(&q->tag_set->tag_list_lock); + up_read(&q->tag_set->tag_list_rwsem); =20 kobject_uevent(q->mq_kobj, KOBJ_REMOVE); kobject_del(q->mq_kobj); @@ -257,10 +257,10 @@ void blk_mq_sysfs_unregister(struct gendisk *disk) struct blk_mq_hw_ctx *hctx; unsigned long i; =20 - mutex_lock(&q->tag_set->tag_list_lock); + down_read(&q->tag_set->tag_list_rwsem); queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); - mutex_unlock(&q->tag_set->tag_list_lock); + up_read(&q->tag_set->tag_list_rwsem); =20 kobject_uevent(q->mq_kobj, KOBJ_REMOVE); kobject_del(q->mq_kobj); diff --git a/block/blk-mq.c b/block/blk-mq.c index d626d32f6e57..9211d32ce820 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -335,12 +335,12 @@ void blk_mq_quiesce_tagset(struct blk_mq_tag_set *set) { struct request_queue *q; =20 - mutex_lock(&set->tag_list_lock); + down_read(&set->tag_list_rwsem); list_for_each_entry(q, &set->tag_list, tag_set_list) { if (!blk_queue_skip_tagset_quiesce(q)) blk_mq_quiesce_queue_nowait(q); } - mutex_unlock(&set->tag_list_lock); + up_read(&set->tag_list_rwsem); =20 blk_mq_wait_quiesce_done(set); } @@ -350,12 +350,12 @@ void blk_mq_unquiesce_tagset(struct blk_mq_tag_set *s= et) { struct request_queue *q; =20 - mutex_lock(&set->tag_list_lock); + down_read(&set->tag_list_rwsem); list_for_each_entry(q, &set->tag_list, tag_set_list) { if (!blk_queue_skip_tagset_quiesce(q)) blk_mq_unquiesce_queue(q); } - mutex_unlock(&set->tag_list_lock); + up_read(&set->tag_list_rwsem); } EXPORT_SYMBOL_GPL(blk_mq_unquiesce_tagset); =20 @@ -4274,56 +4274,63 @@ static void queue_set_hctx_shared(struct request_qu= eue *q, bool shared) } } =20 -static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set, - bool shared) -{ - struct request_queue *q; - unsigned int memflags; - - lockdep_assert_held(&set->tag_list_lock); - - list_for_each_entry(q, &set->tag_list, tag_set_list) { - memflags =3D blk_mq_freeze_queue(q); - queue_set_hctx_shared(q, shared); - blk_mq_unfreeze_queue(q, memflags); - } -} - static void blk_mq_del_queue_tag_set(struct request_queue *q) { struct blk_mq_tag_set *set =3D q->tag_set; + struct request_queue *firstq; + unsigned int memflags; =20 - mutex_lock(&set->tag_list_lock); + down_write(&set->tag_list_rwsem); list_del(&q->tag_set_list); - if (list_is_singular(&set->tag_list)) { - /* just transitioned to unshared */ - set->flags &=3D ~BLK_MQ_F_TAG_QUEUE_SHARED; - /* update existing queue */ - blk_mq_update_tag_set_shared(set, false); + if (!list_is_singular(&set->tag_list)) { + up_write(&set->tag_list_rwsem); + goto out; } - mutex_unlock(&set->tag_list_lock); + + /* + * Transitioning the remaining firstq to unshared. + * Also, downgrade the semaphore to avoid deadlock + * with blk_mq_quiesce_tagset() while waiting for + * firstq to be frozen. + */ + set->flags &=3D ~BLK_MQ_F_TAG_QUEUE_SHARED; + downgrade_write(&set->tag_list_rwsem); + firstq =3D list_first_entry(&set->tag_list, struct request_queue, + tag_set_list); + memflags =3D blk_mq_freeze_queue(firstq); + queue_set_hctx_shared(firstq, false); + blk_mq_unfreeze_queue(firstq, memflags); + up_read(&set->tag_list_rwsem); +out: INIT_LIST_HEAD(&q->tag_set_list); } =20 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, struct request_queue *q) { - mutex_lock(&set->tag_list_lock); + struct request_queue *firstq; + unsigned int memflags; =20 - /* - * Check to see if we're transitioning to shared (from 1 to 2 queues). - */ - if (!list_empty(&set->tag_list) && - !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) { - set->flags |=3D BLK_MQ_F_TAG_QUEUE_SHARED; - /* update existing queue */ - blk_mq_update_tag_set_shared(set, true); - } - if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED) - queue_set_hctx_shared(q, true); - list_add_tail(&q->tag_set_list, &set->tag_list); + down_write(&set->tag_list_rwsem); + if (!list_is_singular(&set->tag_list)) { + if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED) + queue_set_hctx_shared(q, true); + list_add_tail(&q->tag_set_list, &set->tag_list); + up_write(&set->tag_list_rwsem); + return; + } =20 - mutex_unlock(&set->tag_list_lock); + /* Transitioning firstq and q to shared. */ + set->flags |=3D BLK_MQ_F_TAG_QUEUE_SHARED; + list_add_tail(&q->tag_set_list, &set->tag_list); + downgrade_write(&set->tag_list_rwsem); + queue_set_hctx_shared(q, true); + firstq =3D list_first_entry(&set->tag_list, struct request_queue, + tag_set_list); + memflags =3D blk_mq_freeze_queue(firstq); + queue_set_hctx_shared(firstq, true); + blk_mq_unfreeze_queue(firstq, memflags); + up_read(&set->tag_list_rwsem); } =20 /* All allocations will be freed in release handler of q->mq_kobj */ @@ -4855,7 +4862,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) if (ret) goto out_free_mq_map; =20 - mutex_init(&set->tag_list_lock); + init_rwsem(&set->tag_list_rwsem); INIT_LIST_HEAD(&set->tag_list); =20 return 0; @@ -5044,7 +5051,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_m= q_tag_set *set, struct xarray elv_tbl, et_tbl; bool queues_frozen =3D false; =20 - lockdep_assert_held(&set->tag_list_lock); + lockdep_assert_held_write(&set->tag_list_rwsem); =20 if (set->nr_maps =3D=3D 1 && nr_hw_queues > nr_cpu_ids) nr_hw_queues =3D nr_cpu_ids; @@ -5129,9 +5136,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_m= q_tag_set *set, void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queu= es) { down_write(&set->update_nr_hwq_lock); - mutex_lock(&set->tag_list_lock); + down_write(&set->tag_list_rwsem); __blk_mq_update_nr_hw_queues(set, nr_hw_queues); - mutex_unlock(&set->tag_list_lock); + up_write(&set->tag_list_rwsem); up_write(&set->update_nr_hwq_lock); } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b25d12545f46..4c8441671518 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -502,7 +502,7 @@ enum hctx_type { * @shared_tags: * Shared set of tags. Has @nr_hw_queues elements. If set, * shared by all @tags. - * @tag_list_lock: Serializes tag_list accesses. + * @tag_list_rwsem: Serializes tag_list accesses. * @tag_list: List of the request queues that use this tag set. See also * request_queue.tag_set_list. * @srcu: Use as lock when type of the request queue is blocking @@ -530,7 +530,7 @@ struct blk_mq_tag_set { =20 struct blk_mq_tags *shared_tags; =20 - struct mutex tag_list_lock; + struct rw_semaphore tag_list_rwsem; struct list_head tag_list; struct srcu_struct *srcu; struct srcu_struct tags_srcu; --=20 2.51.0