From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 79C76181B96 for ; Tue, 14 May 2024 17:53:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709226; cv=none; b=elpzmMyF8XI6q43onSv3sPFhpTnfOLQ3phg4r+CGxP8y4wCSzEkhTlDQmFdqndK27+0MFG9yD6nlQefyEoB8AcI96VzfwP+d/sIVqc9fYN2WjGqnq2VWYlmP+jt3wgAjEkU+4qNenJTBWbXL3Hn2aFz4aPGS59yFgUW6L3l1zOo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709226; c=relaxed/simple; bh=o5KUGg2KS1AmCQ0prYwGDR7DSjizqXsJpnTKNoT/JQk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=JR2Ei87ast0iwCz8L8XYtj0W1dBSgWPMhLQuMOKOGsdKVPizkwHGpc8bBAR2CW4gNDcNBXYvROGZRlWpzzfWOfVGckjpS8HVXC9Ije9ulSa2iGOcQ1RYs1c0LngI6fBk+HNKkHaoAJz619KngC4am0RKef+SYy75FGhXuS3V+Es= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=AuTg8Jcn; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AuTg8Jcn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709223; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=spH1JDbuK4ZYnUhfY5CtXEY/+ATQ9QBw/vzumT0zZiI=; b=AuTg8JcntBDfBpMc6naIYWpyguVZ0fhSv/BG3ON/ocEMmsuI4ksOjZxG5RKMzYMFhGA8zY wS9DNKHNF09d5dn8IKiFx8CWptfEtpsh9iUX5hNPti59Iw8t6bj/KFw+qA7O15//eAPhb+ 3C9QHTRY/MxUNvUtRqlg/9kj6E3PuH8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-587-vQ3kFgJwPRa0E5_slYnDiQ-1; Tue, 14 May 2024 13:53:30 -0400 X-MC-Unique: vQ3kFgJwPRa0E5_slYnDiQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9514C101A54F; Tue, 14 May 2024 17:53:29 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id 79D3E4B400E; Tue, 14 May 2024 17:53:28 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 1/6] nvme: multipath: Implemented new iopolicy "queue-depth" Date: Tue, 14 May 2024 13:53:17 -0400 Message-Id: <20240514175322.19073-2-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" From: "Ewan D. Milne" The existing iopolicies are inefficient in some cases, such as the presence of a path with high latency. The round-robin policy would use that path equally with faster paths, which results in sub-optimal performance. The queue-depth policy instead sends I/O requests down the path with the least amount of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests in their queues compared to "bad" paths. The aim is to use those paths the most to bring down overall latency. This implementation adds an atomic variable to the nvme_ctrl struct to represent the queue depth. It is updated each time a request specific to that controller starts or ends. [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace and compilation warnings, updated MODULE_PARM description, and fixed potential issue with ->current_path[] being used] Tested-by: John Meneghini Co-developed-by: Thomas Song Signed-off-by: Thomas Song Signed-off-by: Ewan D. Milne --- drivers/nvme/host/multipath.c | 59 +++++++++++++++++++++++++++++++++-- drivers/nvme/host/nvme.h | 2 ++ 2 files changed, 58 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 5397fb428b24..9e36002d0831 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, static const char *nvme_iopolicy_names[] =3D { [NVME_IOPOLICY_NUMA] =3D "numa", [NVME_IOPOLICY_RR] =3D "round-robin", + [NVME_IOPOLICY_QD] =3D "queue-depth", }; =20 static int iopolicy =3D NVME_IOPOLICY_NUMA; @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struc= t kernel_param *kp) iopolicy =3D NVME_IOPOLICY_NUMA; else if (!strncmp(val, "round-robin", 11)) iopolicy =3D NVME_IOPOLICY_RR; + else if (!strncmp(val, "queue-depth", 11)) + iopolicy =3D NVME_IOPOLICY_QD; else return -EINVAL; =20 @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kern= el_param *kp) module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, &iopolicy, 0644); MODULE_PARM_DESC(iopolicy, - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue= -depth'"); =20 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) { @@ -130,6 +133,7 @@ void nvme_mpath_start_request(struct request *rq) if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq)) return; =20 + atomic_inc(&ns->ctrl->nr_active); nvme_req(rq)->flags |=3D NVME_MPATH_IO_STATS; nvme_req(rq)->start_time =3D bdev_start_io_acct(disk->part0, req_op(rq), jiffies); @@ -142,6 +146,8 @@ void nvme_mpath_end_request(struct request *rq) =20 if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS)) return; + + atomic_dec(&ns->ctrl->nr_active); bdev_end_io_acct(ns->head->disk->part0, req_op(rq), blk_rq_bytes(rq) >> SECTOR_SHIFT, nvme_req(rq)->start_time); @@ -330,6 +336,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nv= me_ns_head *head, return found; } =20 +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head) +{ + struct nvme_ns *best_opt =3D NULL, *best_nonopt =3D NULL, *ns; + unsigned int min_depth_opt =3D UINT_MAX, min_depth_nonopt =3D UINT_MAX; + unsigned int depth; + + list_for_each_entry_rcu(ns, &head->list, siblings) { + if (nvme_path_is_disabled(ns)) + continue; + + depth =3D atomic_read(&ns->ctrl->nr_active); + + switch (ns->ana_state) { + case NVME_ANA_OPTIMIZED: + if (depth < min_depth_opt) { + min_depth_opt =3D depth; + best_opt =3D ns; + } + break; + + case NVME_ANA_NONOPTIMIZED: + if (depth < min_depth_nonopt) { + min_depth_nonopt =3D depth; + best_nonopt =3D ns; + } + break; + default: + break; + } + } + + return best_opt ? best_opt : best_nonopt; +} + static inline bool nvme_path_is_optimized(struct nvme_ns *ns) { return nvme_ctrl_state(ns->ctrl) =3D=3D NVME_CTRL_LIVE && @@ -338,15 +378,27 @@ static inline bool nvme_path_is_optimized(struct nvme= _ns *ns) =20 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) { - int node =3D numa_node_id(); + int iopolicy =3D READ_ONCE(head->subsys->iopolicy); + int node; struct nvme_ns *ns; =20 + /* + * queue-depth iopolicy does not need to reference ->current_path + * but round-robin needs the last path used to advance to the + * next one, and numa will continue to use the last path unless + * it is or has become not optimized + */ + if (iopolicy =3D=3D NVME_IOPOLICY_QD) + return nvme_queue_depth_path(head); + + node =3D numa_node_id(); ns =3D srcu_dereference(head->current_path[node], &head->srcu); if (unlikely(!ns)) return __nvme_find_path(head, node); =20 - if (READ_ONCE(head->subsys->iopolicy) =3D=3D NVME_IOPOLICY_RR) + if (iopolicy =3D=3D NVME_IOPOLICY_RR) return nvme_round_robin_path(head, node, ns); + if (unlikely(!nvme_path_is_optimized(ns))) return __nvme_find_path(head, node); return ns; @@ -905,6 +957,7 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl) mutex_init(&ctrl->ana_lock); timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0); INIT_WORK(&ctrl->ana_work, nvme_ana_work); + atomic_set(&ctrl->nr_active, 0); } =20 int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *= id) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index f243a5822c2b..e7d0a56d35d4 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -354,6 +354,7 @@ struct nvme_ctrl { size_t ana_log_size; struct timer_list anatt_timer; struct work_struct ana_work; + atomic_t nr_active; #endif =20 #ifdef CONFIG_NVME_HOST_AUTH @@ -402,6 +403,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(stru= ct nvme_ctrl *ctrl) enum nvme_iopolicy { NVME_IOPOLICY_NUMA, NVME_IOPOLICY_RR, + NVME_IOPOLICY_QD, }; =20 struct nvme_subsystem { --=20 2.39.3 From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFD211802B1 for ; Tue, 14 May 2024 17:53:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709217; cv=none; b=nF2XR9PbfdqxmbrutZzlom2m0kbqwyE09UK0+jHv6pAO8PtNPU28KM9FcySyusBsjK6V4UWUhUHlAsbzEnuscQ1UUeVBjVqIZ6iv+AXnFK+nbdnHFnRBXtApOHNL+keXnsLfiLetqMzL9Jc3sVMAeqJ6YqfPiJp327Qh0W/xGMs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709217; c=relaxed/simple; bh=EiZsyI8d5/iT9mBJabkbwAVJGQIFKsERJkeDpSU8nCk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=il/PCN9kHxuRAv+UowD50Akrck1Xk3udChTl43qwppBgLXMPBEZMw1PVLr7eGG/8pJvijPSW5z8W3wkQ/SBXvqLNB/xzeRYhZY0IN8QL7pXr0oLYkOfuD6AcCc5bpnp1A0UUfCRZDGKjGKaF+uTNccCQkklxqQ6SMqoO+9ahTb0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=TIfEsGPn; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TIfEsGPn" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709214; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6h1FkNmsvcraJuPKXwoMATWjG6B4CiSCjyJHEuznVDE=; b=TIfEsGPn0rwH1ZszSoLiTEzx5RKB+5Ab7ZlVVRvmb/9Gh2mJ01W+kgYyBQb1diXU8j5liw kHKHZ8Et/79yEqZIQRyLW/urtVNTAJmIcm1zDzbGhsflBSDY/XtOyPdMWj+BmsyRpDGiNY RxTGdWXnN/81A04ardesU2LxD4PXiy0= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-240-qbfSjT5DPbmEEoR6umgCjw-1; Tue, 14 May 2024 13:53:31 -0400 X-MC-Unique: qbfSjT5DPbmEEoR6umgCjw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CBAA638000A6; Tue, 14 May 2024 17:53:30 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id B1595400EAC; Tue, 14 May 2024 17:53:29 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 2/6] nvme: multipath: only update ctrl->nr_active when using queue-depth iopolicy Date: Tue, 14 May 2024 13:53:18 -0400 Message-Id: <20240514175322.19073-3-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" From: "Ewan D. Milne" The atomic updates of ctrl->nr_active are unnecessary when using numa or round-robin iopolicy, so avoid that cost on a per-request basis. Clear nr_active when changing iopolicy and do not decrement below zero. (This handles changing the iopolicy while requests are in flight.) Tested-by: John Meneghini Signed-off-by: Ewan D. Milne --- drivers/nvme/host/core.c | 2 +- drivers/nvme/host/multipath.c | 21 ++++++++++++++++++--- drivers/nvme/host/nvme.h | 6 ++++++ 3 files changed, 25 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index a066429b790d..1dd7c52293ff 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq; EXPORT_SYMBOL_GPL(nvme_delete_wq); =20 static LIST_HEAD(nvme_subsystems); -static DEFINE_MUTEX(nvme_subsystems_lock); +DEFINE_MUTEX(nvme_subsystems_lock); =20 static DEFINE_IDA(nvme_instance_ida); static dev_t nvme_ctrl_base_chr_devt; diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 9e36002d0831..1e9338543ded 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -133,7 +133,8 @@ void nvme_mpath_start_request(struct request *rq) if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq)) return; =20 - atomic_inc(&ns->ctrl->nr_active); + if (READ_ONCE(ns->head->subsys->iopolicy) =3D=3D NVME_IOPOLICY_QD) + atomic_inc(&ns->ctrl->nr_active); nvme_req(rq)->flags |=3D NVME_MPATH_IO_STATS; nvme_req(rq)->start_time =3D bdev_start_io_acct(disk->part0, req_op(rq), jiffies); @@ -147,7 +148,8 @@ void nvme_mpath_end_request(struct request *rq) if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS)) return; =20 - atomic_dec(&ns->ctrl->nr_active); + if (READ_ONCE(ns->head->subsys->iopolicy) =3D=3D NVME_IOPOLICY_QD) + atomic_dec_if_positive(&ns->ctrl->nr_active); bdev_end_io_acct(ns->head->disk->part0, req_op(rq), blk_rq_bytes(rq) >> SECTOR_SHIFT, nvme_req(rq)->start_time); @@ -850,6 +852,19 @@ static ssize_t nvme_subsys_iopolicy_show(struct device= *dev, nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]); } =20 +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopoli= cy) +{ + struct nvme_ctrl *ctrl; + + WRITE_ONCE(subsys->iopolicy, iopolicy); + + mutex_lock(&nvme_subsystems_lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + atomic_set(&ctrl->nr_active, 0); + } + mutex_unlock(&nvme_subsystems_lock); +} + static ssize_t nvme_subsys_iopolicy_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { @@ -859,7 +874,7 @@ static ssize_t nvme_subsys_iopolicy_store(struct device= *dev, =20 for (i =3D 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) { if (sysfs_streq(buf, nvme_iopolicy_names[i])) { - WRITE_ONCE(subsys->iopolicy, i); + nvme_subsys_iopolicy_update(subsys, i); return count; } } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index e7d0a56d35d4..4e876524726a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq; extern struct workqueue_struct *nvme_reset_wq; extern struct workqueue_struct *nvme_delete_wq; =20 +extern struct mutex nvme_subsystems_lock; + /* * List of workarounds for devices that required behavior not specified in * the standard. @@ -937,6 +939,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl= ); void nvme_mpath_shutdown_disk(struct nvme_ns_head *head); void nvme_mpath_start_request(struct request *rq); void nvme_mpath_end_request(struct request *rq); +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopoli= cy); =20 static inline void nvme_trace_bio_complete(struct request *req) { @@ -1036,6 +1039,9 @@ static inline bool nvme_disk_is_ns_head(struct gendis= k *disk) { return false; } +static inline void nvme_subsys_iopolicy_update(struct nvme_subsystem *subs= ys, int iopolicy) +{ +} #endif /* CONFIG_NVME_MULTIPATH */ =20 int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, --=20 2.39.3 From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0382B180A80 for ; Tue, 14 May 2024 17:53:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709220; cv=none; b=dIxeuAz4W61HGs+SM7dWmOyZ9aIqZnZorjYysIqJvwLkrGJywRubrekwSMVrLUx4+UE0i2Sf74dQhlpqQkKV0FCmIsoZL9EW3c6H43fx6xP/6rs2zybcLvOxhwBz/tl5sUZNUP2WRN/BaWmENTvkEN0f35p0RGgGPR5FeLZikvo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709220; c=relaxed/simple; bh=b6e8Dnh3maUD9lwtp1+zI5BEIk5U7b2o4S0cX74JYYs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=Kz/fpe3P0OOKT1SHr5bl2iEs0Pu+PDhk8KrT+57FNfgUySiGu+WUOdZxxAOnWZ2NnlnNC0dF2cqOHaa6eHUJhpaQ7C2cTHqrSoDVpaXL/w2iCkfc3+NQR2lZsJa6WMLM9QLMxyZU6DjIacsaDSPcLKS+L7aZVyPYWEbgWGqTi44= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=h4Uz/GOd; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="h4Uz/GOd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709218; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=vFxooJ/Wd2kSlfPg0r+rfS28FX2xdaR0dGl5yZTVUZc=; b=h4Uz/GOdvHsKHGwX5hFhNrTyJ0i+YKiTX5vdWUGK3ji/0cCfulbWS5giGzvvL4KxmYL/Z0 2L3ku8HOl7/MDRNKOljy038RwdBQfuSU71VNUzKLa8hOQrQK5KppZaMLgFVx2GeDsm7/OC 1RZWDzpcMVFTrNWYBq3Matq+Ngluulw= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-222-cAwzDMOuML6LGk8vAbX-0g-1; Tue, 14 May 2024 13:53:32 -0400 X-MC-Unique: cAwzDMOuML6LGk8vAbX-0g-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 118E08001F7; Tue, 14 May 2024 17:53:32 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id E87CC400059; Tue, 14 May 2024 17:53:30 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 3/6] nvme: multipath: Invalidate current_path when changing iopolicy Date: Tue, 14 May 2024 13:53:19 -0400 Message-Id: <20240514175322.19073-4-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" From: "Ewan D. Milne" When switching back to numa from round-robin, current_path may refer to a different path than the one numa would have selected, and it is desirable to have consistent behavior. Tested-by: John Meneghini Signed-off-by: Ewan D. Milne Reviewed-by: Christoph Hellwig --- drivers/nvme/host/multipath.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 1e9338543ded..8702a40a1971 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -861,6 +861,7 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem = *subsys, int iopolicy) mutex_lock(&nvme_subsystems_lock); list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { atomic_set(&ctrl->nr_active, 0); + nvme_mpath_clear_ctrl_paths(ctrl); } mutex_unlock(&nvme_subsystems_lock); } --=20 2.39.3 From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A050180A94 for ; Tue, 14 May 2024 17:53:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709222; cv=none; b=trN+K3QfVna5ME6yMv+kKpiqYbxkb2qrqTIKi0X1OrGJvrBhaulhM5rE/hNEW0RTTfRJZa/OWsMnaYzUyvEvJKq7a+hcsRxbfJBZeX+PERY1DtYAQMnuemHbYQ140D4146UFrP18+m7d3htb6oHPX8tp1Xhh07hl1/Kje2jeIlI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709222; c=relaxed/simple; bh=Q7hAj9aWN4FPMfURj721Lexj13X0WDjGnjxEI0jSZ8U=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=JW+18qfi8ZDcnYrtnPf3jo4Vl0Cx9iqCVPLblyHgR05x3H9gTn293QlFQ1wRgav7lqoJQTMfWDv9PVj3vxtQKe1HpnqirDB1zvPbrxcuNtYViBzBTAWwDzihEoUnlQzZxkV3+8M8Pjfuxa1sy7brcl2CPyWq1KOtirFGk8bG63w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UcNwWfqx; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UcNwWfqx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709219; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kumEGQnJWJG7jTUKOQ63w/Usm47ya7G4ljqiH1pkRFg=; b=UcNwWfqxzLsF48IGyUEbDUNmxrAYNC0YtDS2DhudwB9vLTgIUERP1tvl7voWJY/o9486m0 T7d3tqoqYbl9vAO+zOQPfTNdOrQz2YxFQuMjrxP5OI0HBczn5qcpvkGJirY4QaUeYL8qWF VIyG/g3tg9l59HWWvuX5Z8ae38jzmCE= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-611-LtXqz-vqM1W_RfyGOHnoTA-1; Tue, 14 May 2024 13:53:34 -0400 X-MC-Unique: LtXqz-vqM1W_RfyGOHnoTA-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4983B38000A3; Tue, 14 May 2024 17:53:33 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2D843400EAF; Tue, 14 May 2024 17:53:32 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 4/6] block: track per-node I/O latency Date: Tue, 14 May 2024 13:53:20 -0400 Message-Id: <20240514175322.19073-5-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" From: Hannes Reinecke Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency. This can be used by I/O schedulers to determine the 'best' queue to send I/O to. Signed-off-by: Hannes Reinecke [jmeneghi: cleaned up checkpatch warnings and updated MAINTAINERS] Signed-off-by: John Meneghini --- MAINTAINERS | 1 + block/Kconfig | 9 + block/Makefile | 1 + block/blk-mq-debugfs.c | 2 + block/blk-nlatency.c | 389 +++++++++++++++++++++++++++++++++++++++++ block/blk-rq-qos.h | 6 + include/linux/blk-mq.h | 11 ++ 7 files changed, 419 insertions(+) create mode 100644 block/blk-nlatency.c diff --git a/MAINTAINERS b/MAINTAINERS index 7c121493f43d..a4634365c82f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5405,6 +5405,7 @@ F: block/bfq-cgroup.c F: block/blk-cgroup.c F: block/blk-iocost.c F: block/blk-iolatency.c +F: block/blk-nlatency.c F: block/blk-throttle.c F: include/linux/blk-cgroup.h =20 diff --git a/block/Kconfig b/block/Kconfig index d47398ae9824..d8edb4506769 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -185,6 +185,15 @@ config BLK_CGROUP_IOPRIO scheduler and block devices process requests. Only some I/O schedulers and some block devices support I/O priorities. =20 +config BLK_NODE_LATENCY + bool "Track per-node I/O latency" + help + Enable per-node I/O latency tracking for multipathing. This uses the + blk-nodelat latency tracker to provide latencies for each node, and sched= ules + I/O on the path with the least latency for the submitting node. This can = be + used by I/O schedulers to determine the node with the least latency. Curr= ently + only supports nvme over fabrics devices. + config BLK_DEBUG_FS bool "Block layer debugging information in debugfs" default y diff --git a/block/Makefile b/block/Makefile index 168150b9c510..043d979de8fe 100644 --- a/block/Makefile +++ b/block/Makefile @@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) +=3D blk-throttle.o obj-$(CONFIG_BLK_CGROUP_IOPRIO) +=3D blk-ioprio.o obj-$(CONFIG_BLK_CGROUP_IOLATENCY) +=3D blk-iolatency.o obj-$(CONFIG_BLK_CGROUP_IOCOST) +=3D blk-iocost.o +obj-$(CONFIG_BLK_NODE_LATENCY) +=3D blk-nlatency.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) +=3D mq-deadline.o obj-$(CONFIG_MQ_IOSCHED_KYBER) +=3D kyber-iosched.o bfq-y :=3D bfq-iosched.o bfq-wf2q.o bfq-cgroup.o diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 770c0c2b72fa..bc2541428e81 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -761,6 +761,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id) return "latency"; case RQ_QOS_COST: return "cost"; + case RQ_QOS_NLAT: + return "node-latency"; } return "unknown"; } diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c new file mode 100644 index 000000000000..219c3f636d76 --- /dev/null +++ b/block/blk-nlatency.c @@ -0,0 +1,389 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Per-node request latency tracking. + * + * Copyright (C) 2023 Hannes Reinecke + * + * A simple per-node latency tracker for use by I/O scheduler. + * Latencies are measures over 'win_usec' microseconds and stored per node. + * If the number of measurements falls below 'lowat' the measurement is + * assumed to be unreliable and will become 'stale'. + * These 'stale' latencies can be 'decayed', where during each measurement + * interval the 'stale' latency value is decreased by 'decay' percent. + * Once the 'stale' latency reaches zero it will be updated by the + * measured latency. + */ +#include +#include +#include + +#include "blk-stat.h" +#include "blk-rq-qos.h" +#include "blk.h" + +#define NLAT_DEFAULT_LOWAT 2 +#define NLAT_DEFAULT_DECAY 50 + +struct rq_nlat { + struct rq_qos rqos; + + u64 win_usec; /* latency measurement window in microseconds */ + unsigned int lowat; /* Low Watermark latency measurement */ + unsigned int decay; /* Percentage for 'decaying' latencies */ + bool enabled; + + struct blk_stat_callback *cb; + + unsigned int num; + u64 *latency; + unsigned int *samples; +}; + +static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos) +{ + return container_of(rqos, struct rq_nlat, rqos); +} + +static u64 nlat_default_latency_usec(struct request_queue *q) +{ + /* + * We default to 2msec for non-rotational storage, and 75msec + * for rotational storage. + */ + if (blk_queue_nonrot(q)) + return 2000ULL; + else + return 75000ULL; +} + +static void nlat_timer_fn(struct blk_stat_callback *cb) +{ + struct rq_nlat *nlat =3D cb->data; + int n; + + for (n =3D 0; n < cb->buckets; n++) { + if (cb->stat[n].nr_samples < nlat->lowat) { + /* + * 'decay' the latency by the specified + * percentage to ensure the queues are + * being tested to balance out temporary + * latency spikes. + */ + nlat->latency[n] =3D + div64_u64(nlat->latency[n] * nlat->decay, 100); + } else + nlat->latency[n] =3D cb->stat[n].mean; + nlat->samples[n] =3D cb->stat[n].nr_samples; + } + if (nlat->enabled) + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); +} + +static int nlat_bucket_node(const struct request *rq) +{ + if (!rq->mq_ctx) + return -1; + return cpu_to_node(blk_mq_rq_cpu((struct request *)rq)); +} + +static void nlat_exit(struct rq_qos *rqos) +{ + struct rq_nlat *nlat =3D RQNLAT(rqos); + + blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb); + blk_stat_free_callback(nlat->cb); + kfree(nlat->samples); + kfree(nlat->latency); + kfree(nlat); +} + +#ifdef CONFIG_BLK_DEBUG_FS +static int nlat_win_usec_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + + seq_printf(m, "%llu\n", nlat->win_usec); + return 0; +} + +static ssize_t nlat_win_usec_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + char val[16] =3D { }; + u64 usec; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >=3D sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err =3D kstrtoull(val, 10, &usec); + if (err) + return err; + blk_stat_deactivate(nlat->cb); + nlat->win_usec =3D usec; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_lowat_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + + seq_printf(m, "%u\n", nlat->lowat); + return 0; +} + +static ssize_t nlat_lowat_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + char val[16] =3D { }; + unsigned int lowat; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >=3D sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err =3D kstrtouint(val, 10, &lowat); + if (err) + return err; + blk_stat_deactivate(nlat->cb); + nlat->lowat =3D lowat; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_decay_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + + seq_printf(m, "%u\n", nlat->decay); + return 0; +} + +static ssize_t nlat_decay_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + char val[16] =3D { }; + unsigned int decay; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >=3D sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err =3D kstrtouint(val, 10, &decay); + if (err) + return err; + if (decay > 100) + return -EINVAL; + blk_stat_deactivate(nlat->cb); + nlat->decay =3D decay; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_enabled_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + + seq_printf(m, "%d\n", nlat->enabled); + return 0; +} + +static int nlat_id_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + + seq_printf(m, "%u\n", rqos->id); + return 0; +} + +static int nlat_latency_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + int n; + + if (!nlat->enabled) + return 0; + + for (n =3D 0; n < nlat->num; n++) { + if (n > 0) + seq_puts(m, " "); + seq_printf(m, "%llu", nlat->latency[n]); + } + seq_puts(m, "\n"); + return 0; +} + +static int nlat_samples_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos =3D data; + struct rq_nlat *nlat =3D RQNLAT(rqos); + int n; + + if (!nlat->enabled) + return 0; + + for (n =3D 0; n < nlat->num; n++) { + if (n > 0) + seq_puts(m, " "); + seq_printf(m, "%u", nlat->samples[n]); + } + seq_puts(m, "\n"); + return 0; +} + +static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] =3D { + {"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write}, + {"lowat", 0600, nlat_lowat_show, nlat_lowat_write}, + {"decay", 0600, nlat_decay_show, nlat_decay_write}, + {"enabled", 0400, nlat_enabled_show}, + {"id", 0400, nlat_id_show}, + {"latency", 0400, nlat_latency_show}, + {"samples", 0400, nlat_samples_show}, + {}, +}; +#endif + +static const struct rq_qos_ops nlat_rqos_ops =3D { + .exit =3D nlat_exit, +#ifdef CONFIG_BLK_DEBUG_FS + .debugfs_attrs =3D nlat_debugfs_attrs, +#endif +}; + +u64 blk_nlat_latency(struct gendisk *disk, int node) +{ + struct rq_qos *rqos; + struct rq_nlat *nlat; + + rqos =3D nlat_rq_qos(disk->queue); + if (!rqos) + return 0; + nlat =3D RQNLAT(rqos); + if (node > nlat->num) + return 0; + + return div64_u64(nlat->latency[node], 1000); +} +EXPORT_SYMBOL_GPL(blk_nlat_latency); + +int blk_nlat_enable(struct gendisk *disk) +{ + struct rq_qos *rqos; + struct rq_nlat *nlat; + + /* Latency tracking not enabled? */ + rqos =3D nlat_rq_qos(disk->queue); + if (!rqos) + return -EINVAL; + nlat =3D RQNLAT(rqos); + if (nlat->enabled) + return 0; + + /* Queue not registered? Maybe shutting down... */ + if (!blk_queue_registered(disk->queue)) + return -EAGAIN; + + nlat->enabled =3D true; + memset(nlat->latency, 0, sizeof(u64) * nlat->num); + memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num); + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return 0; +} +EXPORT_SYMBOL_GPL(blk_nlat_enable); + +void blk_nlat_disable(struct gendisk *disk) +{ + struct rq_qos *rqos =3D nlat_rq_qos(disk->queue); + struct rq_nlat *nlat; + + if (!rqos) + return; + nlat =3D RQNLAT(rqos); + if (nlat->enabled) { + blk_stat_deactivate(nlat->cb); + nlat->enabled =3D false; + } +} +EXPORT_SYMBOL_GPL(blk_nlat_disable); + +int blk_nlat_init(struct gendisk *disk) +{ + struct rq_nlat *nlat; + int ret =3D -ENOMEM; + + nlat =3D kzalloc(sizeof(*nlat), GFP_KERNEL); + if (!nlat) + return -ENOMEM; + + nlat->num =3D num_possible_nodes(); + nlat->lowat =3D NLAT_DEFAULT_LOWAT; + nlat->decay =3D NLAT_DEFAULT_DECAY; + nlat->win_usec =3D nlat_default_latency_usec(disk->queue); + + nlat->latency =3D kcalloc(nlat->num, sizeof(u64), GFP_KERNEL); + if (!nlat->latency) + goto err_free; + nlat->samples =3D kcalloc(nlat->num, sizeof(unsigned int), GFP_KERNEL); + if (!nlat->samples) + goto err_free; + nlat->cb =3D blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node, + nlat->num, nlat); + if (!nlat->cb) + goto err_free; + + /* + * Assign rwb and add the stats callback. + */ + mutex_lock(&disk->queue->rq_qos_mutex); + ret =3D rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops); + mutex_unlock(&disk->queue->rq_qos_mutex); + if (ret) + goto err_free_cb; + + blk_stat_add_callback(disk->queue, nlat->cb); + + return 0; + +err_free_cb: + blk_stat_free_callback(nlat->cb); +err_free: + kfree(nlat->samples); + kfree(nlat->latency); + kfree(nlat); + return ret; +} +EXPORT_SYMBOL_GPL(blk_nlat_init); diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 37245c97ee61..2fc11ced0c00 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -17,6 +17,7 @@ enum rq_qos_id { RQ_QOS_WBT, RQ_QOS_LATENCY, RQ_QOS_COST, + RQ_QOS_NLAT, }; =20 struct rq_wait { @@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request= _queue *q) return rq_qos_id(q, RQ_QOS_LATENCY); } =20 +static inline struct rq_qos *nlat_rq_qos(struct request_queue *q) +{ + return rq_qos_id(q, RQ_QOS_NLAT); +} + static inline void rq_wait_init(struct rq_wait *rq_wait) { atomic_set(&rq_wait->inflight, 0); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 89ba6b16fe8b..925e8c19bedb 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -1150,4 +1150,15 @@ static inline int blk_rq_map_sg(struct request_queue= *q, struct request *rq, } void blk_dump_rq_flags(struct request *, char *); =20 +#ifdef CONFIG_BLK_NODE_LATENCY +int blk_nlat_enable(struct gendisk *disk); +void blk_nlat_disable(struct gendisk *disk); +u64 blk_nlat_latency(struct gendisk *disk, int node); +int blk_nlat_init(struct gendisk *disk); +#else +static inline int blk_nlat_enable(struct gendisk *disk) { return 0; } +static inline void blk_nlat_disable(struct gendisk *disk) {} +static inline u64 blk_nlat_latency(struct gendisk *disk, int node) { retur= n 0; } +static inline int blk_nlat_init(struct gendisk *disk) { return -EOPNOTSUPP= ; } +#endif #endif /* BLK_MQ_H */ --=20 2.39.3 From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6F2B181316 for ; Tue, 14 May 2024 17:53:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709223; cv=none; b=O2D5D7cQalnKl1F1JGqcf1TqgaRbCp6h052kejy5YHNhnCV1rNojhnevIxTr2wB73H/+y5l0+UMeTn1t6jpVJeLusP/XKApcn1DcyOGafxjzyL+dS7dq4MXWGH5ACIWHkqnu62jLmu/fG8QbrVUbf0OF8kmnPqJzV58dKs2xmsI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709223; c=relaxed/simple; bh=tyTlc47bpEBT+9eQ9/vAAVVsht9NlbbJFRx/wkcmz58=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=bIFQAzeaC2/f8M6OlqF40usiOnPn4y7NArOX9YRlOYj4TMh0depkJTEW3f6KsynlhEcIOcjPsGYN3d+e35ZabP+9HXnE45vhZWLBWb8IQ54znSBLq+Enf0vUwnZ03CQK8ZXWgLuhpTmgpYvk20AoAHEDXSvdfipdUhUTThShTiE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=CI3zTJOK; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="CI3zTJOK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wBO927D575iuHOlHVpwJTixenliXhHTxZ0CblZ6Nq8M=; b=CI3zTJOKZOfqwzWAWkrijm5W7UmWEHUNX3A/EMQeLqPLz1dHRWlu/m+vZqBBLroda8l91J ad6a5xe6YqUDY2NWVXrfUYvOCszC5adfxJ8Ul3OAq8CWbKTOnqxokAmQeItNeGoITozd/a 8jhpGY9mBfogxPcFoEUTZfH1N7ryiBI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-498-ZMwMrSutM7uMeoQSpaA88g-1; Tue, 14 May 2024 13:53:35 -0400 X-MC-Unique: ZMwMrSutM7uMeoQSpaA88g-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 872A68016FA; Tue, 14 May 2024 17:53:34 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id 67CCB400F13; Tue, 14 May 2024 17:53:33 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 5/6] nvme: add 'latency' iopolicy Date: Tue, 14 May 2024 13:53:21 -0400 Message-Id: <20240514175322.19073-6-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" From: Hannes Reinecke Add a latency-based I/O policy for multipathing. It uses the blk-nodelat latency tracker to provide latencies for each node, and schedules I/O on the path with the least latency for the submitting node. Signed-off-by: Hannes Reinecke [jmeneghi: fix CONFIG_BLK_NODE_LATENCY n and add latency iopolicy to modinf= o] Signed-off-by: John Meneghini --- drivers/nvme/host/multipath.c | 62 ++++++++++++++++++++++++++++++----- drivers/nvme/host/nvme.h | 1 + 2 files changed, 55 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 8702a40a1971..e9330bb1990b 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] =3D { [NVME_IOPOLICY_NUMA] =3D "numa", [NVME_IOPOLICY_RR] =3D "round-robin", [NVME_IOPOLICY_QD] =3D "queue-depth", + [NVME_IOPOLICY_LAT] =3D "latency", }; =20 static int iopolicy =3D NVME_IOPOLICY_NUMA; @@ -32,6 +33,10 @@ static int nvme_set_iopolicy(const char *val, const stru= ct kernel_param *kp) iopolicy =3D NVME_IOPOLICY_RR; else if (!strncmp(val, "queue-depth", 11)) iopolicy =3D NVME_IOPOLICY_QD; +#ifdef CONFIG_BLK_NODE_LATENCY + else if (!strncmp(val, "latency", 7)) + iopolicy =3D NVME_IOPOLICY_LAT; +#endif else return -EINVAL; =20 @@ -43,10 +48,36 @@ static int nvme_get_iopolicy(char *buf, const struct ke= rnel_param *kp) return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]); } =20 +static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopol= icy) +{ + struct nvme_ns_head *h; + struct nvme_ns *ns; + bool enable =3D iopolicy =3D=3D NVME_IOPOLICY_LAT; + int ret =3D 0; + + mutex_lock(&subsys->lock); + list_for_each_entry(h, &subsys->nsheads, entry) { + list_for_each_entry_rcu(ns, &h->list, siblings) { + if (enable) { + ret =3D blk_nlat_enable(ns->disk); + if (ret) + break; + } else + blk_nlat_disable(ns->disk); + } + } + mutex_unlock(&subsys->lock); + return ret; +} + module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, &iopolicy, 0644); MODULE_PARM_DESC(iopolicy, +#if defined(CONFIG_BLK_NODE_LATENCY) + "Default multipath I/O policy; 'numa' (default) , 'round-robin', 'queue-d= epth' or 'latency'"); +#else "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue= -depth'"); +#endif =20 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) { @@ -250,13 +281,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_n= s_head *head, int node) { int found_distance =3D INT_MAX, fallback_distance =3D INT_MAX, distance; struct nvme_ns *found =3D NULL, *fallback =3D NULL, *ns; + int iopolicy =3D READ_ONCE(head->subsys->iopolicy); =20 list_for_each_entry_rcu(ns, &head->list, siblings) { if (nvme_path_is_disabled(ns)) continue; =20 - if (READ_ONCE(head->subsys->iopolicy) =3D=3D NVME_IOPOLICY_NUMA) + if (iopolicy =3D=3D NVME_IOPOLICY_NUMA) distance =3D node_distance(node, ns->ctrl->numa_node); + else if (iopolicy =3D=3D NVME_IOPOLICY_LAT) + distance =3D blk_nlat_latency(ns->disk, node); else distance =3D LOCAL_DISTANCE; =20 @@ -380,8 +414,8 @@ static inline bool nvme_path_is_optimized(struct nvme_n= s *ns) =20 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) { - int iopolicy =3D READ_ONCE(head->subsys->iopolicy); int node; + int iopolicy =3D READ_ONCE(head->subsys->iopolicy); struct nvme_ns *ns; =20 /* @@ -400,8 +434,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_he= ad *head) =20 if (iopolicy =3D=3D NVME_IOPOLICY_RR) return nvme_round_robin_path(head, node, ns); - - if (unlikely(!nvme_path_is_optimized(ns))) + if (iopolicy =3D=3D NVME_IOPOLICY_LAT || + unlikely(!nvme_path_is_optimized(ns))) return __nvme_find_path(head, node); return ns; } @@ -871,15 +905,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct devi= ce *dev, { struct nvme_subsystem *subsys =3D container_of(dev, struct nvme_subsystem, dev); - int i; + int i, ret; =20 for (i =3D 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) { if (sysfs_streq(buf, nvme_iopolicy_names[i])) { - nvme_subsys_iopolicy_update(subsys, i); - return count; + ret =3D nvme_activate_iopolicy(subsys, i); + if (!ret) { + nvme_subsys_iopolicy_update(subsys, i); + return count; + } + return ret; } } - return -EINVAL; } SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR, @@ -915,6 +952,15 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl= *ctrl, =20 void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid) { + if (!blk_nlat_init(ns->disk) && + READ_ONCE(ns->head->subsys->iopolicy) =3D=3D NVME_IOPOLICY_LAT) { + int ret =3D blk_nlat_enable(ns->disk); + + if (unlikely(ret)) + pr_warn("%s: Failed to enable latency tracking, error %d\n", + ns->disk->disk_name, ret); + } + if (nvme_ctrl_use_ana(ns->ctrl)) { struct nvme_ana_group_desc desc =3D { .grpid =3D anagrpid, diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 4e876524726a..56b78f21406a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -406,6 +406,7 @@ enum nvme_iopolicy { NVME_IOPOLICY_NUMA, NVME_IOPOLICY_RR, NVME_IOPOLICY_QD, + NVME_IOPOLICY_LAT, }; =20 struct nvme_subsystem { --=20 2.39.3 From nobody Fri Dec 19 19:09:35 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB022181307 for ; Tue, 14 May 2024 17:53:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709222; cv=none; b=eQukMzuuERbRtHzGLIG2B/1pWZiNJSRqASIIe+0qtDa7tYLwLfUr5aYNRqtKAt8LDXPKBdg2IeLjBv+YLZdl/ZGu1Dz23xQTdpEmCOpV4a2aMn49zlj6FfB8YGXA/fQawhatxDJRTYkAhBtbgO0DgpAFtEi1HS8v4reiCUHDHxs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715709222; c=relaxed/simple; bh=29Ogzu8HUgDu3PF9FKvjcGb+SavKhmrv7g+xDkNBUkY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=n+u4mqnMNyirs9t+nDisLkUvw8aHRJNAQ/ec2ZQFGGbGjk9CQX30eyS4/ndMHLDU7eeP9D/AfSRolKdCdBg6feSBwwWnObGJG8LsVWKzQOoVfrwzhdDzTY5DKtNzNkYlpKI7ObbrjOurkAZLlNp8qvTTT90EGiqqUhMmRk5CsyU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=H+QZPFSo; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="H+QZPFSo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715709220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J8WReQiv4GBqlDuegj32JcgtpHrEsWG68ox5pprOL3Y=; b=H+QZPFSohMIki4oHu8rhha/RnYOE0Ow+Fe08dLlpFN2Ya/IJErvO8Tzvnp1fFOav4zWgro sIbKKe2JnToOA24vLAzEflM1G1xmDUXH6yE5+2rQ6QhDvm2fl1YUg3BIvC7WiNeLNiya6u K4ZHDpKwdl23egNhrinoC2SP0ZwToOg= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-494-qrfXvdWFNuuPCZl6fv1CTg-1; Tue, 14 May 2024 13:53:36 -0400 X-MC-Unique: qrfXvdWFNuuPCZl6fv1CTg-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E2D1229AA2C9; Tue, 14 May 2024 17:53:35 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.2.17.24]) by smtp.corp.redhat.com (Postfix) with ESMTP id A433A400057; Tue, 14 May 2024 17:53:34 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com Subject: [PATCH v4 6/6] nvme: multipath: pr_notice when iopolicy changes Date: Tue, 14 May 2024 13:53:22 -0400 Message-Id: <20240514175322.19073-7-jmeneghi@redhat.com> In-Reply-To: <20240514175322.19073-1-jmeneghi@redhat.com> References: <20240514175322.19073-1-jmeneghi@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.9 Content-Type: text/plain; charset="utf-8" Send a pr_notice when ever the iopolicy on a subsystem is changed. This is important for support reasons. It is fully expected that users will be changing the iopolicy with active IO in progress. Signed-off-by: John Meneghini --- drivers/nvme/host/multipath.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index e9330bb1990b..0286e44a081f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -67,6 +67,10 @@ static int nvme_activate_iopolicy(struct nvme_subsystem = *subsys, int iopolicy) } } mutex_unlock(&subsys->lock); + + pr_notice("%s: %s enable %d status %d for subsysnqn %s\n", __func__, + nvme_iopolicy_names[iopolicy], enable, ret, subsys->subnqn); + return ret; } =20 @@ -890,6 +894,8 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem = *subsys, int iopolicy) { struct nvme_ctrl *ctrl; =20 + int old_iopolicy =3D READ_ONCE(subsys->iopolicy); + WRITE_ONCE(subsys->iopolicy, iopolicy); =20 mutex_lock(&nvme_subsystems_lock); @@ -898,6 +904,10 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem= *subsys, int iopolicy) nvme_mpath_clear_ctrl_paths(ctrl); } mutex_unlock(&nvme_subsystems_lock); + + pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__, + nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy], + subsys->subnqn); } =20 static ssize_t nvme_subsys_iopolicy_store(struct device *dev, --=20 2.39.3