From nobody Tue Jun 16 11:17:57 2026 Received: from LO3P265CU004.outbound.protection.outlook.com (mail-uksouthazon11020104.outbound.protection.outlook.com [52.101.196.104]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD149262FD0; Sun, 19 Apr 2026 02:30:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.196.104 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776565848; cv=fail; b=tbKHsarohtzSxhq1WeZ6ScuCfWdh2DSmFMRTNrLmKLlyut5niz8effcENWxU0EmWwowTHcLvQs0gFGmlEtPdFteuztZCf42pqLHGytEc50ZoFuasKfS0MM6/xSWpZcR8JFPCaFLWkpzcHXCx7+DilVpxc55Ipjk6SlgZpoc+C74= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776565848; c=relaxed/simple; bh=VUGXg0qr+d1A1JdCNeOSJHOPd3wtLZ7ieN5UEZ0OAP0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=ljakZfeHC0BB+RgDu6Vz6hrJfIHnds9HxocNdsh9lovctSVhUcH7capBIMxQ8P2eSOB+rd+TxVZFayJL7rXDE5e+bsNB0jGlatM+dzERz4Pzl3ZY89xp5g6u00MIH8v5mq8atbAnfWWZUS5gN/GA9Jz9qLhWGbUADJTupNsSToQ= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=atomlin.com; spf=pass smtp.mailfrom=atomlin.com; arc=fail smtp.client-ip=52.101.196.104 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=atomlin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=atomlin.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Afr1mSETwm5A0tODuGpoTcq87aH0WTvD9Hc7ykKbjYxIyFEhysHLR4AXWMidjDsWhTG0ukur5qTUDFlOZ+waspSOnwSm4ZUtBx/JbTwxQMUyUPk5FaSiMnv54p1QW40ZwpzLhgp2a+8r4bk/RsDKT2fJ/MJtmUfGm+rm4xA89gSgi+lJD522MRtQm/FzM0TWS0HcLjRg1bjQWLhHjUDtAU2k8CM4JA2KHN4/KwHiO0OYD2e8zM4FvIbgq732dTFDUNZh0A1+NIEr3e2XaVkUlchyxemAUI21Hz0JPgibnLjuk7eEBAJfCe+gLBF9ITEHx/Zk5hLRNWwpCYEeK+rrBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=L+QHXPLlanixVeDyvsPf8AI29xOirJeK0KqPbvmtv1s=; b=HU0BeSq0k2//anfU2AIbVUcrqKuLiJpSZkzwPPYBfnxci8JvnPEUtV6LZ/vVx7TdOw0lV0nJqi4on74ANXWo41fMO/x6mRiCCflTIqdg005c4OyTMdkqYhilry89LaYtEeqVKNvaVLGBTsYhdPWNfD/hlR7yGWWTJBJU2pCgS54gBevevEWX2gFV27b/kntuD3QSq0V/KRLRpdJyhw/qtEX0HPNZj61bpKkj0x/Ld+67XD9Bvj+x9Z5/JiLJgLdmBt2mFmfK4b4/9ypX6MWJI52hoOsvjRg5B461LRYLPpllbGUUUqTGTUBYBu6I3P4tWaF/s8MkGVhfEi0MV8xzPA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=atomlin.com; dmarc=pass action=none header.from=atomlin.com; dkim=pass header.d=atomlin.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=atomlin.com; Received: from CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM (2603:10a6:400:70::10) by LO2P123MB5960.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:255::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9818.31; Sun, 19 Apr 2026 02:30:44 +0000 Received: from CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM ([fe80::de8e:2e4f:6c6:f3bf]) by CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM ([fe80::de8e:2e4f:6c6:f3bf%2]) with mapi id 15.20.9769.046; Sun, 19 Apr 2026 02:30:44 +0000 From: Aaron Tomlin To: axboe@kernel.dk, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: bvanassche@acm.org, johannes.thumshirn@wdc.com, kch@nvidia.com, dlemoal@kernel.org, ritesh.list@gmail.com, loberman@redhat.com, neelx@suse.com, sean@ashe.io, mproche@gmail.com, chjohnst@gmail.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: [PATCH v4 1/2] blk-mq: add tracepoint block_rq_tag_wait Date: Sat, 18 Apr 2026 22:30:35 -0400 Message-ID: <20260419023036.1419514-2-atomlin@atomlin.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260419023036.1419514-1-atomlin@atomlin.com> References: <20260419023036.1419514-1-atomlin@atomlin.com> Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: BN0PR02CA0049.namprd02.prod.outlook.com (2603:10b6:408:e5::24) To CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM (2603:10a6:400:70::10) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CWLP123MB3523:EE_|LO2P123MB5960:EE_ X-MS-Office365-Filtering-Correlation-Id: 20f2f262-1e76-4835-47dd-08de9dbba88d X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|366016|1800799024|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: Gkeb/9T90mgzJXnioDtCSl6ik4K9+Mjzp7/zx5k8ktYN6YqF8QLf7VXVPAYZ4AffhznQYqa0Okk1UkfJ9O4pMBo8EkUdBzs6Vulis7Nml96p9qRQSd2MPZubqiomBRz39sPctC70y15pCM5a/bItlVFxqjIpA4sTRFufrJTGjL+dw5uEZN+pOBgy/XDrm6m0126AhrFr4cINI5h9wQP94wSN/IjdpZjglrC5uckHUCJOMSo2J/kKkC1jar9Q/T3DWXuq/O7xza59qvKzGInDPzLnBP2y5oRnM1hEKV4m3b9W8sSWiRtvwe5NUzF+5+iv9dvQKbzi2t13IPY0p5mszMTsBMMWO1ejzxBjglX+NW0HvrN8vuWSukXePAfnSOz91XTGS0tVI+mrFTG2H0qjOEjrlKiAtYE82L6y34ffdSl/KSpHBYyqgb1lXcE97YD9k3NkWiaTe3hbdz/d+rUdJVqp0mZkdZ69dc5weUS0sqQgKPliIbKI49kNESacMzSUgYHYmrTBURA4TIlgYkv66e+9j8W9fwqnFjnw44n3S/gXlzQGxCqgnbNdQrUlw8HOBVzjlVds18EjNSZpHrUCbVeLUqveZyfVRBmU6vZBorUnqMg72tdJCuS9w2ZIJn/gLoG7UTDAClQO0BNeAJm8JbhhxSFdWu4pOOaOOaNQ/NAtLQM8+MDkrnjtw0nqxXhQs1soxaf8VOSQF5D2n2Viyr9fx0t5SnqzaQLLC0cpRPo= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM;PTR:;CAT:NONE;SFS:(13230040)(7416014)(376014)(366016)(1800799024)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?YN5H7PlS3hOoSqIUgd5OZcGeke2BaEnxC8n8YK4pki6/BdDOMY9kwY+ia+eM?= =?us-ascii?Q?71qvypY6L7+a4PGk6dJev46xW9hEiYwEKx4/3Zv8TKBkn3/nt/d3SFTCUNVx?= =?us-ascii?Q?iElwzN0meM2GSB+ojfJto1PTFYCZ6ELTdSXRMhkd3j2K7HssowXCOJ72rJDE?= =?us-ascii?Q?JFXfK9xvFSWmEYVNUYNHjudaFugWMkAWHZag4St0xVjXazkkB3hbCJIVuF7g?= =?us-ascii?Q?GXcIr46Tjypchek0ZmP+9nBJ20CXThxWalfqZAOYHYNCfEqn3uqdmu5MP12S?= =?us-ascii?Q?qjfO2C6vYES74ayhdNYhvJCoKPChLWl3zqBCOAaZ0iaijre0C8lyo39m1TXx?= =?us-ascii?Q?qkPRMIWTY1T2QJIqTf5Th3J/iLybiUfOF/zWCY3rhz5TuBl1TdJ7j5nFjlfq?= =?us-ascii?Q?9AUcan/2+SlfbAeD94mmOTbdv8toBGXRuw+OPOjRCjzauMuSebJeXsyCP/ds?= =?us-ascii?Q?yy2XHjuaausQnPJreot3eeoVenam6MJGSxGUThtypiA17YIbmcJnYvpVYhQ1?= =?us-ascii?Q?KtWZwNHYJxvog/hb3YEofPEZs9J4GTTM7G2Ox2scLJc4zlvVWg0GayOAarnC?= =?us-ascii?Q?eZNqUct/WBwpVb7NV1NpeZWvscOf20jaBK0/KTUZTbVcTYet4Y60PloEXboj?= =?us-ascii?Q?Ph4/wAjDE4Ky82zZcEY1aZo+UVckE3x9M9GE+XdK5T4AVXx8hTWk9Ii+9BVg?= =?us-ascii?Q?0rteFFGpclPVKVRE8DJEsNL4xGz334rNiQo1kn6pekNCrDPhViq/EMExpr1T?= =?us-ascii?Q?XYzhwD8N1K0Z5aHzrCCVTQDGi0LRzrLGEMdCk4qNSCcYSFJlUbQ5MBw4RecM?= =?us-ascii?Q?6fpe5AtBIH5DLLk6ElInnS6Y+NZfYrc8Pyv2RLj51nWmbxPM+X9RdVex1vLq?= =?us-ascii?Q?8QXdUMNPiExavYEiMcoA6V7y+BV7y03gS6/Q7bCWvDsr5YZIYHBKtRM8q/eF?= =?us-ascii?Q?KB4bX6ktJ/3JRcX1DOaJHYOItzwaxLYnE/6cJTpeEg52yEAsez8eXVxf+U+s?= =?us-ascii?Q?Q5XzpXjcmgtrZikiSaYpvqWeaOr7M/0pjcJfIHRnDGFfMT8Q3fAUD2uQKah+?= =?us-ascii?Q?vng3QPdxz96ouHQho916G2EKm7o51doCNpOBvkSrulIud9bqgx89RVgVsbP5?= =?us-ascii?Q?1B5+9uYw5oGeoW1tG120D+OyZSrvIduJt2GMW28NI2y/ylOfhyu5gZEh+/31?= =?us-ascii?Q?qLJybfe3pvPwWWp/4OcNBaAwJk7x4BpCnxWIKyZnOoLvLXV7A4uoEvx078dK?= =?us-ascii?Q?5Nifoz6gvr2TslPKq6KG5yCHnd+38Vqh04hzJdsKXYfEFwYSiiK15r8Va8e8?= =?us-ascii?Q?vZZFAS+RKzxkwlxwEJdKQw86zwSTDk8kIez8edfgEiypl0iPS8QIU2qBMwYI?= =?us-ascii?Q?OJ0jqT546qRCzHcWISglpCiN4+vCr3UVuOjOON8a/3AFz28bkK/oPIw0PLsC?= =?us-ascii?Q?d/kBZBwRnd7aCDXZY0RcKb75MUMyJw/GrQafVKvM8zzT/eGKsTTLetmDtc8d?= =?us-ascii?Q?2Np5meuFbI+D+FCU0aCvsxkMdpD97d5xaMx7E3ATpUvDWVBklB+9k0AC5cz4?= =?us-ascii?Q?GZ8uj0xv9ye9tRL4ZtlIsRVZoFroZCTKXtaVcEhHd24GWGuDKcmaz1RkWx2r?= =?us-ascii?Q?nJLr635FEXsUagdZ+8bKK018Un0htComvbQONGMHaelGgMX4OiZ/w6yPjQag?= =?us-ascii?Q?iPBWGPYgNVJ3PDZdNKdLmZlfC0x7wz5KusgzwPFurLT3HpXWrOV9dVMGFHaE?= =?us-ascii?Q?HTWuOAamMA=3D=3D?= X-OriginatorOrg: atomlin.com X-MS-Exchange-CrossTenant-Network-Message-Id: 20f2f262-1e76-4835-47dd-08de9dbba88d X-MS-Exchange-CrossTenant-AuthSource: CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Apr 2026 02:30:44.7270 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: e6a32402-7d7b-4830-9a2b-76945bbbcb57 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: WXZEUeML5HurVOrJ9Q6xsNcBU9GEMFxXkONT42xQkDz9N06anfiE6zF536yj5a9bNk+0l4hZkzjH5O98bil0pg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: LO2P123MB5960 Content-Type: text/plain; charset="utf-8" In high-performance storage environments, particularly when utilising RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency spikes can occur when fast devices (SSDs) are starved of hardware tags when sharing the same blk_mq_tag_set. Currently, diagnosing this specific hardware queue contention is difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag() forces the current thread to block uninterruptible via io_schedule(). While this can be inferred via sched:sched_switch or dynamically traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no dedicated, out-of-the-box observability for this event. This patch introduces the block_rq_tag_wait trace point in the tag allocation slow-path. It triggers immediately before the thread yields the CPU, exposing the exact hardware context (hctx) that is starved, the specific pool experiencing starvation (hardware or software scheduler), and the total pool depth. This provides storage engineers and performance monitoring agents with a zero-configuration, low-overhead mechanism to definitively identify shared-tag bottlenecks and tune I/O schedulers or cgroup throttling accordingly. Reviewed-by: Johannes Thumshirn Reviewed-by: Damien Le Moal Reviewed-by: Chaitanya Kulkarni Reviewed-by: Laurence Oberman Tested-by: Laurence Oberman Signed-off-by: Aaron Tomlin --- block/blk-mq-tag.c | 4 ++++ include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 33946cdb5716..66138dd043d4 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -13,6 +13,7 @@ #include =20 #include +#include #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" @@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *d= ata) if (tag !=3D BLK_MQ_NO_TAG) break; =20 + trace_block_rq_tag_wait(data->q, data->hctx, + data->rq_flags & RQF_SCHED_TAGS); + bt_prev =3D bt; io_schedule(); =20 diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 6aa79e2d799c..71554b94e4d0 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq, IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm) ); =20 +/** + * block_rq_tag_wait - triggered when a request is starved of a tag + * @q: request queue of the target device + * @hctx: hardware context of the request experiencing starvation + * @is_sched_tag: indicates whether the starved pool is the software sched= uler + * + * Called immediately before the submitting context is forced to block due + * to the exhaustion of available tags (i.e., physical hardware driver tags + * or software scheduler tags). This trace point indicates that the context + * will be placed into an uninterruptible state via io_schedule() until an + * active request completes and relinquishes its assigned tag. + */ +TRACE_EVENT(block_rq_tag_wait, + + TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sch= ed_tag), + + TP_ARGS(q, hctx, is_sched_tag), + + TP_STRUCT__entry( + __field( dev_t, dev ) + __field( u32, hctx_id ) + __field( u32, nr_tags ) + __field( bool, is_sched_tag ) + ), + + TP_fast_assign( + __entry->dev =3D disk_devt(q->disk); + __entry->hctx_id =3D hctx->queue_num; + __entry->is_sched_tag =3D is_sched_tag; + + if (is_sched_tag) + __entry->nr_tags =3D hctx->sched_tags->nr_tags; + else + __entry->nr_tags =3D hctx->tags->nr_tags; + ), + + TP_printk("%d,%d hctx=3D%u starved on %s tags (depth=3D%u)", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->hctx_id, + __entry->is_sched_tag ? "scheduler" : "hardware", + __entry->nr_tags) +); + /** * block_rq_insert - insert block operation request into queue * @rq: block IO operation request --=20 2.51.0 From nobody Tue Jun 16 11:17:57 2026 Received: from CWXP265CU008.outbound.protection.outlook.com (mail-ukwestazon11020079.outbound.protection.outlook.com [52.101.195.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3555279DB3; Sun, 19 Apr 2026 02:30:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.195.79 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776565852; cv=fail; b=LVqmikJ98tKdp94uJAOUOhpWFpht0J4/ikK8wEw4F54ndCRs4BeXgxKIE7UNK9zlw0OVwy1+T1QNEASou+Yxyh8ZXGOxEANfQgKfqagZtYLun2Z/6zsuwU3VEzsrTirujBpgIJC8U2kZnjBP9s4WQvAvyHZC9C81lkouURPTYoA= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776565852; c=relaxed/simple; bh=uaaarnnigyOz2hSE8V896hY8CUoXSACaMm+OZG8dbVI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=Dl1/3kYeXNSa4vFqQZLM58DYVxYVgpgq7zjPe0AolkFoPJ3rxwd+1+3lwnCKkhg/l5kWIeKAagPef/+B/CAIZfkhtXmRTBBfKNQ+ERW3whAv4Fz2/9HeE67tXhxbsZjo4e8YvoKNEiDuS3wll0I7S40CSBNAM9cgOeno0dgStBg= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=atomlin.com; spf=pass smtp.mailfrom=atomlin.com; arc=fail smtp.client-ip=52.101.195.79 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=atomlin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=atomlin.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=wQcyVfnDfAnOKvIyJbTYuwDN9zGuhN6pOLyjgwsf63MgDgBXp52We8Q60WAttONDEV0sJcSLq4Z2N7k2HiCfTiGFrL05bsrwuqDFA0KE0wyQ6zLbrg+xBS3B6isSQk3Yw7JWAtT513JY8ttf0wjF3lXgV/VBUzY8Wz5/RqxMyKcJ/TvLUg5DY1Sj8Tij5zz5dA0idG5CFR9rUpG2cNx/1JI/fnesSLCit98dwrAjyo3KpoL0h8nPP2lpXU0NviRhPUFVxcmjJF+G55Er8L5M3VsW8q+0B9bVSbaaI+IWK7KIbcS8GkkK/xAWLBP0geHwxy1zTaL59MWjTzKVyiSPAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mUcGUDhjWUooYRRlF/14NUYfw35YqjzdytWavXpkpxY=; b=NGw1tuBFoBZDjvtPEugDSzISTLGErICKb9ykqAsNOFwlt/GntaXa0wqddxw5gfEs3cIx5R/ShPZxqQ8ao4enxA/Vt/EVFd4o3ToAUV5/UDWyaT+WFkc/tNk+S1d+FfuTCNu3pKhIw9kroWCaHyNxSR6n4HT5Jof64hOqVS7c/EJWzpoARhsICSM9gAeyKn9RLlCKHAqU3eqnw/YqS6aTzgaLfQg6jqE6ZSjbb+fNgZYpFQwyAswIHxvr9ZN6f5eXuEFrUe3TrzCaITwpoJJsbqwzJdOxQVtgldKkxakW9gxIf9ALCvSb+QcjRHzEjAkp62DFlDAscXNQS3mW2lCzQQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=atomlin.com; dmarc=pass action=none header.from=atomlin.com; dkim=pass header.d=atomlin.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=atomlin.com; Received: from CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM (2603:10a6:400:70::10) by LO2P123MB5960.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:255::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9818.31; Sun, 19 Apr 2026 02:30:48 +0000 Received: from CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM ([fe80::de8e:2e4f:6c6:f3bf]) by CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM ([fe80::de8e:2e4f:6c6:f3bf%2]) with mapi id 15.20.9769.046; Sun, 19 Apr 2026 02:30:48 +0000 From: Aaron Tomlin To: axboe@kernel.dk, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: bvanassche@acm.org, johannes.thumshirn@wdc.com, kch@nvidia.com, dlemoal@kernel.org, ritesh.list@gmail.com, loberman@redhat.com, neelx@suse.com, sean@ashe.io, mproche@gmail.com, chjohnst@gmail.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: [PATCH v4 2/2] blk-mq: expose tag starvation counts via debugfs Date: Sat, 18 Apr 2026 22:30:36 -0400 Message-ID: <20260419023036.1419514-3-atomlin@atomlin.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260419023036.1419514-1-atomlin@atomlin.com> References: <20260419023036.1419514-1-atomlin@atomlin.com> Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: BN0PR04CA0138.namprd04.prod.outlook.com (2603:10b6:408:ed::23) To CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM (2603:10a6:400:70::10) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CWLP123MB3523:EE_|LO2P123MB5960:EE_ X-MS-Office365-Filtering-Correlation-Id: b6e47df7-06b7-45f1-d7bc-08de9dbbaab7 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|366016|1800799024|18096099003|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: eXlQgdAzQsIpLyVy31eXtjU3KMAf0ZoAzka4b4+xdGM6dzLra3kaRzHTC8R/fbkbEjUNh1kTVYPAXAkHm4iLU2feagKcCxMA2qR59czM6D++3sNpJ1h+3MbrOTnuHHsFNQcSyukuR5f53AWsB3THixZZpBIGPXqJP8zHoNIClRObF3wc6Bt2e3jZxPx7ylOfgpIdSxD82McOf0l5iJPZncnSpkZ1dOZGtwtO41gNFi9vY811Dmca7qUM0AXqeoVzVJNgWUBvajeAWjmo5s3SUj3rMJjheZK3N+GRtbkssUdMNb6pdGeqVMQx9d4rBwLr8zJfjFC7CMeHww0AC0hxtmsL986UBm9FZBt7sLF7l9aSpW7A0rusJJUa8jg2voQZKdE3MSJMtPC3gsS+y4a28rH51BCKPwauHnvSKJouET1F8c6fLAUvlHK577/nx4Ntm/29lEOfg4sGsOl1i5gDtz5L27vH0C1BeNurnihhRoOSM35vuCULYreqKth8CzBw1vEz04ZqkvBOg0pQCnNk5eTiyLvJ3TE/JlmnzEczjnXmwtBtuBL1WZiCTARAywK9Q+FY2IuN74utsYwV1QXdjEytFcq6+/ifcUBuP7NkLusDTb8zIkLSowXmbrSaS3jdsz3qA2zDKCmmuC7o3T2+ZYDlEiu15ciREq244QtGdZFHTfMm9vgu1vdCYKlkYcsupV6XZwCk50pB4ZeygTWWdwK7VRUngABsPVVHsNMcXdE= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM;PTR:;CAT:NONE;SFS:(13230040)(7416014)(376014)(366016)(1800799024)(18096099003)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?SdivcP8vL5bgqe0AtYgCD9ecPIOUsZqo6buTvyIey15KYDHEK50v1qtAKwry?= =?us-ascii?Q?jIauvq4XAtj+Bq0bNnRXteu0RPFfYaRyckLKkkkdJlkon+G76YkpZFgTQAHP?= =?us-ascii?Q?dOOlKT7hg7kDya/X4QgXrm2PCno6Mn4GEQeinB0WISzOv1gloDDH1caLMgbo?= =?us-ascii?Q?ywTSE1wcZEDqk7Z+AkhcYN6ZIlGbVZL1B0CruopUkIDcyfC+NsiGEGkqmu+D?= =?us-ascii?Q?dwWlyr4aIYdHge9RCdAcatW3FXoCywANa8JugnGvj7++wrGODK8je0hB2Rp3?= =?us-ascii?Q?yDOMit3O+ER+GLNA+ughxYo36+DFh7JnyqOpZz2NOyNAa1q8d17zLKY6iPh1?= =?us-ascii?Q?voRgZQvgWXg2m6xx9klSGuNenqgJJePPugSaSOudn4aaSgPDjYZglu8Ue/l2?= =?us-ascii?Q?0WCt04EhYrfB5Ylj4nliOS3V4yTh1we5b+98///faK0rlZByu64pxSAvPKEq?= =?us-ascii?Q?0BOXlJoD7WpBuAwTWpssPkDWZnJGtQnvvMs46mKq6ZWmY/entRxb2BcNWjbs?= =?us-ascii?Q?ubhazUD1C0FdWGaWq+TBO/M/er/RWMcNUyG9g0dvpC97JgVEIzyS5IPzswCD?= =?us-ascii?Q?SAGfbEifHVveygJKlm+cGV6JaRSzsYnsnKNWfZ2KM5697NpS5OSI2tFvZ4UC?= =?us-ascii?Q?kk5m4fSHRKmmGljemaB5V/4Y2uQu90tVjBZE/7R8G5gvJFWwD8BgyEzAhTYV?= =?us-ascii?Q?ie4FQa5+lylEPxxHbXxfixPqguN5vjw1hrZK64Fx7pWnP+UxDiGIQUFyFPQN?= =?us-ascii?Q?8CloTOSgFftXpMdacqaPCnm1ehS+UlPC4QESCvaoZsfNP3DLLTDNSeKGL6UJ?= =?us-ascii?Q?TTSrPt/jTbYydRuTGMlVzl+XNdj2dE7sGxr0wR8xuhte2FLVP2o7O5z+lGs5?= =?us-ascii?Q?57JXgu4s8km8SkNZrbfpWs+9hs5yxW7RyVwYmO0O3S72hz9o7poI2CX8OttO?= =?us-ascii?Q?kt4mx1UGex0wdk7K/3ANDHMaFAseaCCnLOEZMpl0pB0NyU8GmxPgb0DTBzMP?= =?us-ascii?Q?VnQ81d/2nlIdA8cca4inxAvHOIHeuWkqGEnnRR4+9pn7i+K5VUO29fYU0vXZ?= =?us-ascii?Q?mMKVV4MyEqvoS1lY/IYi+egNDFN5CsogoZuRrttlcQHPFh+hCZdhvxCzlKPt?= =?us-ascii?Q?ksju23lC6ajqWNwu+w5CCaRy+Eh4GrSHuxZsrYWORff6X3KRdOckzLGjYhdx?= =?us-ascii?Q?+4QwrO5ohvbrHnajbQl/ckheYyIe+jpeVt1TaLMh7IkE4G1K09ycKwbHNSf7?= =?us-ascii?Q?Y+2IV6n3r+rVwHsaHy8PIqWFtpM2Y/wpghvh6HI3ntfwXDqDVKW5h9yZiDFQ?= =?us-ascii?Q?QX8T/Y9Hjr73ob71wuMeMKzLg6vJHibF2KfBZsSANoxuAqOa1E3yaVeLUh9c?= =?us-ascii?Q?HxFFwz43xMMzBtY9Pb5w3q4xjC7L1fzEDQBosv7qgLEYK9xFFscifRcLxQ9m?= =?us-ascii?Q?wuUze+gPddUxIYtLvl5jMi5VbDqoxJJ74d6r/dp0wE5TCpAp2ORHmRw6Sywm?= =?us-ascii?Q?WrRPMcet/xPLts7qZQdslKTbYt9tY02znmwaskXA2rhpV24imkIbYpplJEEC?= =?us-ascii?Q?3AzpAGUCtDcZgeGeYUYJH52XZ01GmZkYIDFeqlYIFfvSoYSO+ox7fV9LqB37?= =?us-ascii?Q?mVsu5b43sblY2c6MvWzv/f/j6hkRj/+xUBJh9S5MEMZLxL3m2BZQ9+c5EAr7?= =?us-ascii?Q?ECtWx3PtQ/oQjOErmdx2lCQPDbLFhV4FAHqZ1SebAwK0o4nVHY8OJgmIO0BS?= =?us-ascii?Q?OV1oXLi0Cg=3D=3D?= X-OriginatorOrg: atomlin.com X-MS-Exchange-CrossTenant-Network-Message-Id: b6e47df7-06b7-45f1-d7bc-08de9dbbaab7 X-MS-Exchange-CrossTenant-AuthSource: CWLP123MB3523.GBRP123.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 19 Apr 2026 02:30:48.3279 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: e6a32402-7d7b-4830-9a2b-76945bbbcb57 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: jc8myD1MG1w1zDKDVBA+SmmYhJYWTd09UjkTuebukunGn3wIMlRuFXhT8qdUmktWsp0dD3TaXbBByWPlfoJoIQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: LO2P123MB5960 Content-Type: text/plain; charset="utf-8" In high-performance storage environments, particularly when utilising RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency spikes can occur when fast devices are starved of available tags. This patch introduces two new debugfs attributes for each block hardware queue: - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag These files expose atomic counters that increment each time a submitting context is forced into an uninterruptible sleep via io_schedule() due to the complete exhaustion of physical driver tags or software scheduler tags, respectively. To ensure negligible performance overhead even in production environments where CONFIG_BLK_DEBUG_FS is actively enabled, this tracking logic utilises dynamically allocated per-CPU counters. When this configuration is disabled, the tracking logic compiles down to a safe no-op. Signed-off-by: Aaron Tomlin --- block/blk-mq-debugfs.c | 84 ++++++++++++++++++++++++++++++++++++++++++ block/blk-mq-debugfs.h | 7 ++++ block/blk-mq-tag.c | 4 ++ include/linux/blk-mq.h | 12 ++++++ 4 files changed, 107 insertions(+) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 047ec887456b..a3effed55d90 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -7,6 +7,7 @@ #include #include #include +#include =20 #include "blk.h" #include "blk-mq.h" @@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct = seq_file *m) return 0; } =20 +/** + * hctx_wait_on_hw_tag_show - display hardware tag starvation count + * @data: generic pointer to the associated hardware context (hctx) + * @m: seq_file pointer for debugfs output formatting + * + * Prints the cumulative number of times a submitting context was forced + * to block due to the exhaustion of physical hardware driver tags. + * + * Return: 0 on success. + */ +static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m) +{ + struct blk_mq_hw_ctx *hctx =3D data; + unsigned long count =3D 0; + int cpu; + + if (hctx->wait_on_hw_tag) { + for_each_possible_cpu(cpu) + count +=3D *per_cpu_ptr(hctx->wait_on_hw_tag, cpu); + } + seq_printf(m, "%lu\n", count); + return 0; +} + +/** + * hctx_wait_on_sched_tag_show - display scheduler tag starvation count + * @data: generic pointer to the associated hardware context (hctx) + * @m: seq_file pointer for debugfs output formatting + * + * Prints the cumulative number of times a submitting context was forced + * to block due to the exhaustion of software scheduler tags. + * + * Return: 0 on success. + */ +static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m) +{ + struct blk_mq_hw_ctx *hctx =3D data; + unsigned long count =3D 0; + int cpu; + + if (hctx->wait_on_sched_tag) { + for_each_possible_cpu(cpu) + count +=3D *per_cpu_ptr(hctx->wait_on_sched_tag, cpu); + } + seq_printf(m, "%lu\n", count); + return 0; +} + #define CTX_RQ_SEQ_OPS(name, type) \ static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \ __acquires(&ctx->lock) \ @@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_= hctx_attrs[] =3D { {"active", 0400, hctx_active_show}, {"dispatch_busy", 0400, hctx_dispatch_busy_show}, {"type", 0400, hctx_type_show}, + {"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show}, + {"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show}, {}, }; =20 @@ -670,6 +721,11 @@ void blk_mq_debugfs_register_hctx(struct request_queue= *q, snprintf(name, sizeof(name), "hctx%u", hctx->queue_num); hctx->debugfs_dir =3D debugfs_create_dir(name, q->debugfs_dir); =20 + if (!hctx->wait_on_hw_tag) + hctx->wait_on_hw_tag =3D alloc_percpu(unsigned long); + if (!hctx->wait_on_sched_tag) + hctx->wait_on_sched_tag =3D alloc_percpu(unsigned long); + debugfs_create_files(q, hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs); =20 @@ -684,6 +740,11 @@ void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_c= tx *hctx) debugfs_remove_recursive(hctx->debugfs_dir); hctx->sched_debugfs_dir =3D NULL; hctx->debugfs_dir =3D NULL; + + free_percpu(hctx->wait_on_hw_tag); + hctx->wait_on_hw_tag =3D NULL; + free_percpu(hctx->wait_on_sched_tag); + hctx->wait_on_sched_tag =3D NULL; } =20 void blk_mq_debugfs_register_hctxs(struct request_queue *q) @@ -815,3 +876,26 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_m= q_hw_ctx *hctx) debugfs_remove_recursive(hctx->sched_debugfs_dir); hctx->sched_debugfs_dir =3D NULL; } + +/** + * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters + * @hctx: hardware context associated with the tag allocation + * @is_sched: true if the starved pool is the software scheduler + * + * Evaluates the exhausted tag pool and safely increments the appropriate + * per-cpu debugfs starvation counter. + * + * Note: A race window exists during rapid device probe or CPU hotplug + * where I/O might be submitted before blk_mq_debugfs_register_hctx() has + * completed allocating the per-CPU counters. Therefore, the pointer is + * explicitly checked to prevent a NULL pointer dereference. + */ +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx, + bool is_sched) +{ + unsigned long __percpu *tags =3D is_sched ? hctx->wait_on_sched_tag : + hctx->wait_on_hw_tag; + + if (likely(tags)) + this_cpu_inc(*tags); +} diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h index 49bb1aaa83dc..a0094d004d08 100644 --- a/block/blk-mq-debugfs.h +++ b/block/blk-mq-debugfs.h @@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr { const struct seq_operations *seq_ops; }; =20 +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx, + bool is_sched); int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq); int blk_mq_debugfs_rq_show(struct seq_file *m, void *v); =20 @@ -35,6 +37,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_= hw_ctx *hctx); =20 void blk_mq_debugfs_register_rq_qos(struct request_queue *q); #else +static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx, + bool is_sched) +{ +} + static inline void blk_mq_debugfs_register(struct request_queue *q) { } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 66138dd043d4..3cc6a97a87a0 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -17,6 +17,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-sched.h" +#include "blk-mq-debugfs.h" =20 /* * Recalculate wakeup batch when tag is shared by hctx. @@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *d= ata) trace_block_rq_tag_wait(data->q, data->hctx, data->rq_flags & RQF_SCHED_TAGS); =20 + blk_mq_debugfs_inc_wait_tags(data->hctx, + data->rq_flags & RQF_SCHED_TAGS); + bt_prev =3D bt; io_schedule(); =20 diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index ebc45557aee8..17cd6221bb93 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -453,6 +453,18 @@ struct blk_mq_hw_ctx { struct dentry *debugfs_dir; /** @sched_debugfs_dir: debugfs directory for the scheduler. */ struct dentry *sched_debugfs_dir; + /** + * @wait_on_hw_tag: Cumulative per-cpu counter incremented each + * time a submitting context is forced to block due to physical + * hardware tag exhaustion. + */ + unsigned long __percpu *wait_on_hw_tag; + /** + * @wait_on_sched_tag: Cumulative per-cpu counter incremented each + * time a submitting context is forced to block due to software + * scheduler tag exhaustion. + */ + unsigned long __percpu *wait_on_sched_tag; #endif =20 /** --=20 2.51.0