From nobody Wed Apr 1 12:31:58 2026 Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com [209.85.210.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F5253CCA16 for ; Tue, 31 Mar 2026 10:05:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.193 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774951523; cv=none; b=QLQQA09A2cRIvH77dHVu6GfAcHgG9uAuxLZxFzHU6RNxxS2uIQ5W044d8KN2imoQqZN69VKiWvpwpKHNhGmlzBiTEECJi/r6I8wDQ52oRMG+lb/ir9DbS328TKonA+AaZ5+OCDBRFfwzDl83rAc3TjH+Q9/jGCKSR5+nMf840Ks= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774951523; c=relaxed/simple; bh=brmGGIuWRtCDRtEnvbIzJrncb+vmaxIfIoZnwKQffBY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EKnxC4W3QSsp3HU8OhiP1rS3HWXcDUTkiLOjB+ry968wSBOojrjFNHqA4y/jI+dY391nZTzfQc+/7V0Uwf2mXwviMp5SPDQhKGEjgBxjLCKL4Myyu66XHm8DF1PChxyG/ajeWYC39wzVDJ2uTF6aIDp3caQerwMEyAwhHi9nB/I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ReWgzyiC; arc=none smtp.client-ip=209.85.210.193 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ReWgzyiC" Received: by mail-pf1-f193.google.com with SMTP id d2e1a72fcca58-829a27414a3so3211749b3a.3 for ; Tue, 31 Mar 2026 03:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774951522; x=1775556322; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=y8stTflUqBWilNXQtRU2bQyAp9yk4sf72e3JD82g2ns=; b=ReWgzyiCg0/p0tqK64NLn9gHegbDMLu+ZIy2e+U6BQ9V5ZhhPEaKPHTzwS2d6hApT0 Ad81WTByXgCp4S1wl6xoolgqjBbmGL6qJs3HIVmF0ImAQ3CEKjYfbUgxq3lqV97WEfef R/ub2xvw8gHSDC8r+yw2fcwpC2BGOjxBTMzZM43QuDvBeP14pMYD6WgGVu3r2D61L+wl qZpAOdpgI5zwpsLpScYC8EOd5lhI6trmL6AHieAKpYEh7D/z42ZSGoycXBs9XWoCSFPK 0mnK/SedUfsIGg2CeRuVZyBLsDlfrqpBjYCHIHSz/2chxmTxeldNfYY0/unlwQcZKbKY htJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774951522; x=1775556322; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=y8stTflUqBWilNXQtRU2bQyAp9yk4sf72e3JD82g2ns=; b=ndXAFLxExzJXNIpLpu3hu8+lWjq0LSjlI+D/KuOuXhthKfGH7Gf8to/alEEYKUJo4R uY6MoBsSBlwsIzWP3VtRmeW9SjsW+jnb3zMklIrlvqV1tEOeOhevNFKmVoO3bEB9QqZV /L4DGW53yUEstQO2vGOgUrfuPtqAjLJjawGKcnp7s5vyBfrVgXfivKlmp3nSqu1+PicF 1QIRynDqpN7pxFuX/2tEgOyHCXwfTiMxr5+OvfVaR6GiVl2z1EH6U94IzvhPt4Haxrxn xQvfr8R3rYCz6W/xKSQtsKNk3YMzrlABnwgqhP/XG4YTpgUhlbAM+m/uC+YMhr5kIh9h ITRw== X-Forwarded-Encrypted: i=1; AJvYcCVooBD/kumU15WNIvAHnp1QVgY5+h6pVF4fKfvDFGS/15ThaDsyhm7d4BijqrAuzH4rdvEZ7WnEeq779Vo=@vger.kernel.org X-Gm-Message-State: AOJu0YzpwPZKnswt1+Bl4MQJJDyVMrFb0Dmaj5JZNJZuPz49JcPD6PAG e+purQ4TvxSLL0lkgiKVEJdbVtHtBQV5mF3/LgMuBVSIr6rrmH3mOASY X-Gm-Gg: ATEYQzxsRCpUo2rDzkL2jCb5+3sswJ6GCjwJ84KouZiGiT3hUtZH0f2elX77gvCrAIC LFwyKk5zsjYy19N8lzF+HKAHhRiZwe/vJitYBaxu7gQfTFcPE4VUvZ3ma7n8mZW4qkSXW5DKG+m aPqgj6s2z1dpyCPtSHc0ude4QHflsm3LLdnAyvxb1yTNs+q228pkZEQqlFmmUIbONU57rsHaf4a DzgaraW+0bDhYag9873aY1EqM3DAkr0WCFJd7Dz72DKdnnxU2QJOTFuQKNNHnufdwRh9W0RbVmL Crt/47sK7xduSB5bDpAMPgBKwo2H8Y0EVp9bzZSHl1m05ymah1cqSggFA4OAt2C2zEhMeBNrivA xWi3VAZu4x2+nCasrbp71ft6SDhEqgILFGJWaJ1BIMMbc1q2fuS811xx5Gnjd1tnhs1qijfxPeD LS5cO3pepm5mUDxT0GjAvD3tmH/spHqa8+XQ== X-Received: by 2002:a05:6a00:438a:b0:801:eee2:45b6 with SMTP id d2e1a72fcca58-82c95e7c457mr14501172b3a.24.1774951521532; Tue, 31 Mar 2026 03:05:21 -0700 (PDT) Received: from archwsl.localdomain ([116.232.56.124]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82ca860b125sm9758914b3a.50.2026.03.31.03.05.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Mar 2026 03:05:21 -0700 (PDT) From: Jialin Wang To: wjl.linux@gmail.com Cc: axboe@kernel.dk, cgroups@vger.kernel.org, josef@toxicpanda.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, tj@kernel.org Subject: [PATCH v3] blk-iocost: fix busy_level reset when no IOs complete Date: Tue, 31 Mar 2026 10:05:09 +0000 Message-ID: <20260331100509.182882-1-wjl.linux@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260329154112.526679-1-wjl.linux@gmail.com> References: <20260329154112.526679-1-wjl.linux@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When a disk is saturated, it is common for no IOs to complete within a timer period. Currently, in this case, rq_wait_pct and missed_ppm are calculated as 0, the iocost incorrectly interprets this as meeting QoS targets and resets busy_level to 0. This reset prevents busy_level from reaching the threshold (4) needed to reduce vrate. On certain cloud storage, such as Azure Premium SSD, we observed that iocost may fail to reduce vrate for tens of seconds during saturation, failing to mitigate noisy neighbor issues. Fix this by tracking the number of IO completions (nr_done) in a period. If nr_done is 0 and there are lagging IOs, the saturation status is unknown, so we keep busy_level unchanged. The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5) VMs with 512GB Premium SSD (P20) using the script below. It was not observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no regressions were found with this patch. In this script, cgA performs large IOs with iodepth=3D128, while cgB performs small IOs with iodepth=3D1 rate_iops=3D100 rw=3Drandrw. With iocost enabled, we expect it to throttle cgA, the submission latency (slat) of cgA should be significantly higher, cgB can reach 200 IOPS and the completion latency (clat) should below. BLK_DEVID=3D"8:0" MODEL=3D"rbps=3D173471131 rseqiops=3D3566 rrandiops=3D3566 wbps=3D1733332= 69 wseqiops=3D3566 wrandiops=3D3566" QOS=3D"rpct=3D90 rlat=3D3500 wpct=3D90 wlat=3D3500 min=3D80 max=3D10000" echo "$BLK_DEVID ctrl=3Duser model=3Dlinear $MODEL" > /sys/fs/cgroup/io.c= ost.model echo "$BLK_DEVID enable=3D1 ctrl=3Duser $QOS" > /sys/fs/cgroup/io.cost.qos CG_A=3D"/sys/fs/cgroup/cgA" CG_B=3D"/sys/fs/cgroup/cgB" FILE_A=3D"/path/to/sda/A.fio.testfile" FILE_B=3D"/path/to/sda/B.fio.testfile" RESULT_DIR=3D"./iocost_results_$(date +%Y%m%d_%H%M%S)" mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR" get_result() { local file=3D$1 local label=3D$2 local results=3D$(jq -r ' .jobs[0].mixed |=20 ( .iops | tonumber | round ) as $iops | ( .bw_bytes / 1024 / 1024 ) as $bps | ( .slat_ns.mean / 1000000 ) as $slat | ( .clat_ns.mean / 1000000 ) as $avg | ( .clat_ns.max / 1000000 ) as $max | ( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 | ( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 | ( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 | ( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 | "\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($= p9999)" ' "$file") IFS=3D'|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results" printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.= 2f\n" \ "$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p9= 99" "$p9999" } run_fio() { local cg_path=3D$1 local filename=3D$2 local name=3D$3 local bs=3D$4 local qd=3D$5 local out=3D$6 shift 6 local extra=3D$@ ( pid=3D$(sh -c 'echo $PPID') echo $pid >"${cg_path}/cgroup.procs" fio --name=3D"$name" --filename=3D"$filename" --direct=3D1 --rw=3Dran= drw --rwmixread=3D50 \ --ioengine=3Dlibaio --bs=3D"$bs" --iodepth=3D"$qd" --size=3D4G --= runtime=3D10 \ --time_based --group_reporting --unified_rw_reporting=3Dmixed \ --output-format=3Djson --output=3D"$out" $extra >/dev/null 2>&1 ) & } echo "Starting Test ..." for bs_b in "4k" "32k" "256k"; do echo "Running iteration: BS=3D$bs_b" out_a=3D"${RESULT_DIR}/cgA_1m.json" out_b=3D"${RESULT_DIR}/cgB_${bs_b}.json" # cgA: Heavy background (BS 1MB, QD 128) run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a" # cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100) run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=3D100" wait SUMMARY_DATA+=3D"$(get_result "$out_a" "cgA-1m")"$'\n' SUMMARY_DATA+=3D"$(get_result "$out_b" "cgB-$bs_b")"$'\n\n' done echo -e "\nFinal Results Summary:\n" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \ "" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat" printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \ "CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P= 99" "P99.9" "P99.99" echo "$SUMMARY_DATA" echo "Results saved in $RESULT_DIR" Before: slat clat clat clat clat clat= clat =20 CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.= 9 P99.99 =20 =20 cgA-1m 166 166.37 3.44 748.95 1298.29 977.27 1233.13 1300= .23 1300.23=20 cgB-4k 5 0.02 0.02 181.74 761.32 742.39 759.17 759.= 17 759.17 =20 =20 cgA-1m 167 166.51 1.98 748.68 1549.41 809.50 1451.23 1551= .89 1551.89=20 cgB-32k 6 0.18 0.02 169.98 761.76 742.39 759.17 759.= 17 759.17 =20 =20 cgA-1m 166 165.55 2.89 750.89 1540.37 851.44 1451.23 1535= .12 1535.12=20 cgB-256k 5 1.30 0.02 191.35 759.51 750.78 759.17 759.= 17 759.17 =20 After: slat clat clat clat clat clat= clat =20 CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.= 9 P99.99 =20 =20 cgA-1m 162 162.48 6.14 749.69 850.02 826.28 834.67 843.= 06 851.44 =20 cgB-4k 199 0.78 0.01 1.95 42.12 2.57 7.50 34.8= 7 42.21 =20 =20 cgA-1m 146 146.20 6.83 833.04 908.68 893.39 901.78 910.= 16 910.16 =20 cgB-32k 200 6.25 0.01 2.32 31.40 3.06 7.50 16.5= 8 31.33 =20 =20 cgA-1m 110 110.46 9.04 1082.67 1197.91 1182.79 1199.57 1199= .57 1199.57=20 cgB-256k 200 49.98 0.02 3.69 22.20 4.88 9.11 20.0= 5 22.15 =20 Signed-off-by: Jialin Wang Acked-by: Tejun Heo --- Changes in v3: - Handle only the !nr_done && nr_lagging case and leave the other cases as they are. Changes in v2: - Handle more edge cases to prevent potential regressions. v2: https://lore.kernel.org/all/20260329154112.526679-1-wjl.linux@gmail.com/ v1: https://lore.kernel.org/all/20260318163351.394528-1-wjl.linux@gmail.com/ block/blk-iocost.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/block/blk-iocost.c b/block/blk-iocost.c index d145db61e5c3..0cca88a366dc 100644 --- a/block/blk-iocost.c +++ b/block/blk-iocost.c @@ -1596,7 +1596,8 @@ static enum hrtimer_restart iocg_waitq_timer_fn(struc= t hrtimer *timer) return HRTIMER_NORESTART; } =20 -static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait= _pct_p) +static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait= _pct_p, + u32 *nr_done) { u32 nr_met[2] =3D { }; u32 nr_missed[2] =3D { }; @@ -1633,6 +1634,8 @@ static void ioc_lat_stat(struct ioc *ioc, u32 *missed= _ppm_ar, u32 *rq_wait_pct_p =20 *rq_wait_pct_p =3D div64_u64(rq_wait_ns * 100, ioc->period_us * NSEC_PER_USEC); + + *nr_done =3D nr_met[READ] + nr_met[WRITE] + nr_missed[READ] + nr_missed[W= RITE]; } =20 /* was iocg idle this period? */ @@ -2250,12 +2253,12 @@ static void ioc_timer_fn(struct timer_list *timer) u64 usage_us_sum =3D 0; u32 ppm_rthr; u32 ppm_wthr; - u32 missed_ppm[2], rq_wait_pct; + u32 missed_ppm[2], rq_wait_pct, nr_done; u64 period_vtime; int prev_busy_level; =20 /* how were the latencies during the period? */ - ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct); + ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct, &nr_done); =20 /* take care of active iocgs */ spin_lock_irq(&ioc->lock); @@ -2397,9 +2400,17 @@ static void ioc_timer_fn(struct timer_list *timer) * and should increase vtime rate. */ prev_busy_level =3D ioc->busy_level; - if (rq_wait_pct > RQ_WAIT_BUSY_PCT || - missed_ppm[READ] > ppm_rthr || - missed_ppm[WRITE] > ppm_wthr) { + if (!nr_done && nr_lagging) { + /* + * When there are lagging IOs but no completions, we don't + * know if the IO latency will meet the QoS targets. The + * disk might be saturated or not. We should not reset + * busy_level to 0 (which would prevent vrate from scaling + * up or down), but rather to keep it unchanged. + */ + } else if (rq_wait_pct > RQ_WAIT_BUSY_PCT || + missed_ppm[READ] > ppm_rthr || + missed_ppm[WRITE] > ppm_wthr) { /* clearly missing QoS targets, slow down vrate */ ioc->busy_level =3D max(ioc->busy_level, 0); ioc->busy_level++; --=20 2.53.0