From nobody Sun Feb 8 05:35:42 2026 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9ABB2F9984 for ; Fri, 9 Jan 2026 08:31:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767947518; cv=none; b=aIVIEMB47nmRk9BpdHNetTFhoOiWuWhWynDbbvPEXES9p1NtZohb4iAnUpTTspdDNjLXHgNZqhOj9cLIoTKfzRwLhCMMCW5o1C0IjJeeruaTDpYKXKYyjVOKESMJbExITjcZt8tieFf6U4Z8Uayzx+7B4TSelIDcK1zQsuCvSOY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767947518; c=relaxed/simple; bh=QcWShs4mL2Seyc9tgVAvKR9E+NSPmfi9BEUBl0U3ub8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ScvgEQJ5FUZ6UaAqOzt509/0S8HHYATRP3FfDFl9F4VoSyNkvw0FKaJ8OuSvO8qcuGrI9rdVVk/xxSEKcM1FOsbxXAXxV79rtPY22m+mpybEhY4Bag1cVAKP4U7WfSm5Tn/DMlxMMruAU7KpjJwIEq3aOwZqv8bOdX3V9f4kSaM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mptxX65a; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mptxX65a" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2a0f3f74587so33446455ad.2 for ; Fri, 09 Jan 2026 00:31:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767947513; x=1768552313; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=I6eq99+0tMwvpgFKCbT1OzNZeIGKY3KMsocgxOEaHFo=; b=mptxX65aFYQfcvVM6U8B+bi1yStxiqj79Z52dXQHixGYvsHzbjH/NZ03ASWVXnn0ID vxp87QZ2VwVyjHjAc132fPo5LxuQOEE+YSmR6qvbkY7Wf3+PkgjoBUx+L1jVdjWdRF/w Msfw9Y+nOUFkzTrCGI5Vp386qrR8dVkzf/GRc52i0vqxDfW9biK7Ccvt+IFAbiEaC7dk dihGjEuym9+8Rp0eLyN3BZeGH8EgEOsTkLkK2/bQx8FVjdOon1X2WHtIlFEDUHNefF9I 5IIXK5sXkX9XHYkCj7y/1hEkyHFk8sTBp8aGRpul6pmq8hC1vAOQjlfNq7810B46UaYH 2xlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767947513; x=1768552313; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=I6eq99+0tMwvpgFKCbT1OzNZeIGKY3KMsocgxOEaHFo=; b=q63FUdvUYd+zODiCPHc1wk55JgyGvyUq/w8mNEtJ/7j0cwMp+OV02HeGjW2m/7XNiv eEeJY2w90Tdq6zkiCYMxmi/LrHpwy5KvR5LcSQosDn/qSCgv7M+SPss7A7/Cs1lDBA+z FG6jTodePkEHHz00k+yHZ5fYinkjninrDEGximlRU0YcguhRmtimWtEyZDarnwPTl4Pb knimVjx52suUqGZNJaaHP+UNOKIYeGcnZ+yxslAPuijqN2LQsFUzPcA0DBg4gHfgQCgM 5i0N57I0IZh8vQOhe2XsSjrLh+ignKe170p/GzCmAdrRbPFnhNjvQMlECg9WErpfJ+5L LDJQ== X-Forwarded-Encrypted: i=1; AJvYcCXK6xJrnSHUIWTaluqOTfCKxa/EJGbyDTM6n8pfzWkaXCCrHwA4RDA5n13Yw7XkbTyEqTeWH+bbdp5YMCY=@vger.kernel.org X-Gm-Message-State: AOJu0Yw2ZVozEFYsGFsqCN17EW1P/Hx8yDHc0Xtnxqp+Yhpd6MuOxk1I DUm/9s0IPi977S9wndjKK41Vq8Tect1x5fMogqlKbS1/XKiLL/wy5Mg4 X-Gm-Gg: AY/fxX59P3bwwSgLAUUxDKX4hixGFKybS3ZyOTVVyDzYgm9q0Mqq9hyhEgq3CTzP7qo 1AMqySN+dIfJCT3VzWw9BtUc2lQLmEQeHgS+ycsdSiQ8bBaK2qI0mzx77wegzV63/ze9tfYK97W iyTzPSh/CnOFfWh3ETRmjQPAmuD+9AVcD9MZkDeQAiPhG/NRpLyPlpBgFa77+yK8XxCG08B6X/k 9+5My+8WOqu2r3b8v9Qmku5a2YnOiHX7SystvrkvIg+3OXA97MIvS9PuYY+dF+BZxafgdnS9TTg N5GwZfaWKsowWtGoQaBXAuGsRblnUWCeX6ADoth7YQlLWl3BBMlH0ECEjfmBTA/muE7lOPY6Sr6 ADTI+smxAtKGjAzG+FYjpdP3uzkq/+JPr+w2fUGTjtkFzttpXnYSRhH/6NETqFBRXR3t1s7k2WA u8LtydPTQAVTE5BljXlH/U9eoh3sjjC5PtmPJeeMrazi8= X-Google-Smtp-Source: AGHT+IGpfXTmgOvJ7lrju5/NRykpykvhv9LwCj7H3KkAFw66tefXFSpWJwdw7sqK2zlL4ZNE0PqAzQ== X-Received: by 2002:a17:902:ea0d:b0:2a1:35df:2513 with SMTP id d9443c01a7336-2a3ee45230fmr93698005ad.17.1767947513040; Fri, 09 Jan 2026 00:31:53 -0800 (PST) Received: from L9HW65VV5R.bytedance.net ([101.126.56.83]) by smtp.googlemail.com with ESMTPSA id d9443c01a7336-2a3e3cc793fsm99474095ad.72.2026.01.09.00.31.50 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 09 Jan 2026 00:31:52 -0800 (PST) From: Diangang Li X-Google-Original-From: Diangang Li To: axboe@kernel.dk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, changfengnan@bytedance.com, Diangang Li Subject: [RFC 1/1] block: export windowed IO P99 latency Date: Fri, 9 Jan 2026 16:31:26 +0800 Message-Id: <20260109083126.15052-2-lidiangang@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) In-Reply-To: <20260109083126.15052-1-lidiangang@bytedance.com> References: <20260109083126.15052-1-lidiangang@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Track per-IO completion latency in a power-of-two histogram (NR_STAT_BUCKETS buckets, DISK_LAT_BASE_USEC .. DISK_LAT_MAX_USEC). Maintain a per-cpu sliced ring histogram and compute P99 by aggregating the recent slices at read time in /proc/diskstats and /sys/block//stat. Report P99 in usecs using the bucket midpoint, clamp overflows to DISK_LAT_MAX_USEC, and append the P99 for read/write/discard/flush. Suggested-by: Fengnan Chang Signed-off-by: Diangang Li --- block/blk-core.c | 5 ++- block/blk-flush.c | 6 ++- block/blk-mq.c | 5 ++- block/genhd.c | 50 ++++++++++++++++++++++++- include/linux/part_stat.h | 79 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 139 insertions(+), 6 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 8387fe50ea156..832ba4fc1b75a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1062,12 +1062,15 @@ void bdev_end_io_acct(struct block_device *bdev, en= um req_op op, const int sgrp =3D op_stat_group(op); unsigned long now =3D READ_ONCE(jiffies); unsigned long duration =3D now - start_time; + u64 latency_ns =3D jiffies_to_nsecs(duration); + unsigned int bucket =3D diskstat_latency_bucket(latency_ns); =20 part_stat_lock(); update_io_ticks(bdev, now, true); part_stat_inc(bdev, ios[sgrp]); part_stat_add(bdev, sectors[sgrp], sectors); - part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration)); + part_stat_add(bdev, nsecs[sgrp], latency_ns); + part_stat_latency_record(bdev, sgrp, now, bucket); part_stat_local_dec(bdev, in_flight[op_is_write(op)]); part_stat_unlock(); } diff --git a/block/blk-flush.c b/block/blk-flush.c index 43d6152897a42..b3ff78025968f 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -124,11 +124,13 @@ static void blk_flush_restore_request(struct request = *rq) static void blk_account_io_flush(struct request *rq) { struct block_device *part =3D rq->q->disk->part0; + u64 latency_ns =3D blk_time_get_ns() - rq->start_time_ns; + unsigned int bucket =3D diskstat_latency_bucket(latency_ns); =20 part_stat_lock(); part_stat_inc(part, ios[STAT_FLUSH]); - part_stat_add(part, nsecs[STAT_FLUSH], - blk_time_get_ns() - rq->start_time_ns); + part_stat_add(part, nsecs[STAT_FLUSH], latency_ns); + part_stat_latency_record(part, STAT_FLUSH, jiffies, bucket); part_stat_unlock(); } =20 diff --git a/block/blk-mq.c b/block/blk-mq.c index eff4f72ce83be..6a7fd6681902e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1068,11 +1068,14 @@ static inline void blk_account_io_done(struct reque= st *req, u64 now) */ if ((req->rq_flags & (RQF_IO_STAT|RQF_FLUSH_SEQ)) =3D=3D RQF_IO_STAT) { const int sgrp =3D op_stat_group(req_op(req)); + u64 latency_ns =3D now - req->start_time_ns; + unsigned int bucket =3D diskstat_latency_bucket(latency_ns); =20 part_stat_lock(); update_io_ticks(req->part, jiffies, true); part_stat_inc(req->part, ios[sgrp]); - part_stat_add(req->part, nsecs[sgrp], now - req->start_time_ns); + part_stat_add(req->part, nsecs[sgrp], latency_ns); + part_stat_latency_record(req->part, sgrp, jiffies, bucket); part_stat_local_dec(req->part, in_flight[op_is_write(req_op(req))]); part_stat_unlock(); diff --git a/block/genhd.c b/block/genhd.c index 69c75117ba2c0..56151c7880651 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -108,23 +108,60 @@ static void part_stat_read_all(struct block_device *p= art, struct disk_stats *stat) { int cpu; + u32 now_epoch =3D (u32)(jiffies / HZ); =20 memset(stat, 0, sizeof(struct disk_stats)); for_each_possible_cpu(cpu) { struct disk_stats *ptr =3D per_cpu_ptr(part->bd_stats, cpu); int group; + int slice; + int bucket; =20 for (group =3D 0; group < NR_STAT_GROUPS; group++) { stat->nsecs[group] +=3D ptr->nsecs[group]; stat->sectors[group] +=3D ptr->sectors[group]; stat->ios[group] +=3D ptr->ios[group]; stat->merges[group] +=3D ptr->merges[group]; + + for (slice =3D 0; slice < NR_STAT_SLICES; slice++) { + u32 slice_epoch =3D READ_ONCE(ptr->latency_epoch[slice]); + s32 age =3D (s32)(now_epoch - slice_epoch); + + if (age < 0 || age >=3D NR_STAT_SLICES) + continue; + + for (bucket =3D 0; bucket < NR_STAT_BUCKETS; bucket++) + stat->latency[group][0][bucket] +=3D + ptr->latency[group][slice][bucket]; + } } =20 stat->io_ticks +=3D ptr->io_ticks; } } =20 +static u32 diskstat_p99_us(u32 buckets[NR_STAT_BUCKETS]) +{ + u32 total =3D 0; + u32 accum =3D 0; + u32 target; + int bucket; + + for (bucket =3D 0; bucket < NR_STAT_BUCKETS; bucket++) + total +=3D buckets[bucket]; + if (!total) + return 0; + + target =3D total - div_u64((u64)total, 100); + for (bucket =3D 0; bucket < NR_STAT_BUCKETS; bucket++) { + accum +=3D buckets[bucket]; + if (accum >=3D target) + return diskstat_latency_bucket_us(bucket); + } + + return diskstat_latency_bucket_us(NR_STAT_BUCKETS - 1); +} + static void bdev_count_inflight_rw(struct block_device *part, unsigned int inflight[2], bool mq_driver) { @@ -1078,7 +1115,8 @@ ssize_t part_stat_show(struct device *dev, "%8lu %8lu %8llu %8u " "%8u %8u %8u " "%8lu %8lu %8llu %8u " - "%8lu %8u" + "%8lu %8u " + "%8u %8u %8u %8u" "\n", stat.ios[STAT_READ], stat.merges[STAT_READ], @@ -1100,7 +1138,11 @@ ssize_t part_stat_show(struct device *dev, (unsigned long long)stat.sectors[STAT_DISCARD], (unsigned int)div_u64(stat.nsecs[STAT_DISCARD], NSEC_PER_MSEC), stat.ios[STAT_FLUSH], - (unsigned int)div_u64(stat.nsecs[STAT_FLUSH], NSEC_PER_MSEC)); + (unsigned int)div_u64(stat.nsecs[STAT_FLUSH], NSEC_PER_MSEC), + diskstat_p99_us(stat.latency[STAT_READ][0]), + diskstat_p99_us(stat.latency[STAT_WRITE][0]), + diskstat_p99_us(stat.latency[STAT_DISCARD][0]), + diskstat_p99_us(stat.latency[STAT_FLUSH][0])); } =20 /* @@ -1406,6 +1448,10 @@ static int diskstats_show(struct seq_file *seqf, voi= d *v) seq_put_decimal_ull(seqf, " ", stat.ios[STAT_FLUSH]); seq_put_decimal_ull(seqf, " ", (unsigned int)div_u64(stat.nsecs[STAT_FLU= SH], NSEC_PER_MSEC)); + seq_put_decimal_ull(seqf, " ", diskstat_p99_us(stat.latency[STAT_READ][0= ])); + seq_put_decimal_ull(seqf, " ", diskstat_p99_us(stat.latency[STAT_WRITE][= 0])); + seq_put_decimal_ull(seqf, " ", diskstat_p99_us(stat.latency[STAT_DISCARD= ][0])); + seq_put_decimal_ull(seqf, " ", diskstat_p99_us(stat.latency[STAT_FLUSH][= 0])); seq_putc(seqf, '\n'); } rcu_read_unlock(); diff --git a/include/linux/part_stat.h b/include/linux/part_stat.h index 729415e91215d..cbcb24abac21e 100644 --- a/include/linux/part_stat.h +++ b/include/linux/part_stat.h @@ -5,6 +5,19 @@ #include #include =20 +/* + * Diskstats latency histogram: + * - Bucket upper bounds are power-of-two in usecs, starting at DISK_LAT_B= ASE_USEC. + * - The last bucket is a saturation bucket for latencies >=3D DISK_LAT_MA= X_USEC. + * + * Latency is tracked in NR_STAT_SLICES 1-second slices and + * summed to compute a NR_STAT_SLICES-second P99 latency. + */ +#define NR_STAT_BUCKETS 21 +#define NR_STAT_SLICES 5 +#define DISK_LAT_BASE_USEC 8U +#define DISK_LAT_MAX_USEC (DISK_LAT_BASE_USEC << (NR_STAT_BUCKETS - 1)) + struct disk_stats { u64 nsecs[NR_STAT_GROUPS]; unsigned long sectors[NR_STAT_GROUPS]; @@ -12,6 +25,8 @@ struct disk_stats { unsigned long merges[NR_STAT_GROUPS]; unsigned long io_ticks; local_t in_flight[2]; + u32 latency_epoch[NR_STAT_SLICES]; + u32 latency[NR_STAT_GROUPS][NR_STAT_SLICES][NR_STAT_BUCKETS]; }; =20 /* @@ -81,4 +96,68 @@ static inline void part_stat_set_all(struct block_device= *part, int value) =20 unsigned int bdev_count_inflight(struct block_device *part); =20 +static inline unsigned int diskstat_latency_bucket(u64 latency_ns) +{ + u64 latency_us =3D latency_ns / 1000; + u64 scaled; + + if (latency_us <=3D DISK_LAT_BASE_USEC) + return 0; + + if (latency_us >=3D DISK_LAT_MAX_USEC) + return NR_STAT_BUCKETS - 1; + + scaled =3D div_u64(latency_us - 1, DISK_LAT_BASE_USEC); + return min_t(unsigned int, (unsigned int)fls64(scaled), + NR_STAT_BUCKETS - 1); +} + +static inline u32 diskstat_latency_bucket_upper_us(unsigned int bucket) +{ + if (bucket >=3D NR_STAT_BUCKETS - 1) + return DISK_LAT_MAX_USEC; + return DISK_LAT_BASE_USEC << bucket; +} + +static inline u32 diskstat_latency_bucket_us(unsigned int bucket) +{ + u32 high; + u32 low; + + if (bucket >=3D NR_STAT_BUCKETS - 1) + return DISK_LAT_MAX_USEC; + + high =3D diskstat_latency_bucket_upper_us(bucket); + low =3D high >> 1; + return low + (low >> 1); +} + +static inline void __part_stat_latency_prepare(struct block_device *part, + u32 epoch, unsigned int slice) +{ + struct disk_stats *stats =3D per_cpu_ptr(part->bd_stats, smp_processor_id= ()); + int group; + + if (likely(stats->latency_epoch[slice] =3D=3D epoch)) + return; + + for (group =3D 0; group < NR_STAT_GROUPS; group++) + memset(stats->latency[group][slice], 0, + sizeof(stats->latency[group][slice])); + stats->latency_epoch[slice] =3D epoch; +} + +static inline void part_stat_latency_record(struct block_device *part, + int sgrp, unsigned long now, unsigned int bucket) +{ + u32 epoch =3D now / HZ; + unsigned int slice =3D epoch % NR_STAT_SLICES; + + __part_stat_latency_prepare(part, epoch, slice); + if (bdev_is_partition(part)) + __part_stat_latency_prepare(bdev_whole(part), epoch, slice); + + part_stat_inc(part, latency[sgrp][slice][bucket]); +} + #endif /* _LINUX_PART_STAT_H */ --=20 2.39.5