From nobody Sun Oct 5 03:35:56 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13A5018C933; Sun, 10 Aug 2025 16:52:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754844733; cv=none; b=cH4WuX0xRI/qW5KUeXQtJUmBAVzrWIhDm96kqDkND6Wuve9raE8dPNACM0rrsRAeyAtw8NnNN4P707N6XUNmtRpI9Tisg9yje5wirz5Q2j8TwLBXJE9KhNnm4oyWf/GfIEN3rNd/EOals/5w7V6ElMBBo9Iu+aluTfwf8tKSHf4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754844733; c=relaxed/simple; bh=H5lrnGuVfOa6qujzhCSsl1dNJTJVAl2czNOoW+UpiK4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=DhOrCYJTl5oNpDAN+p9nhWyBHNnWXulLLFW3tyRC+amH0sIQrEuxOU2vnXyvnZQU4hK+QeCECgsXxVI4FaS9dMNeT6Ckhyg/cfGYe6KmWnJVVX2Y3PRGL7WBM0xx6JbLibbzkqcUuTLCO5LS0S4fkv9SjN9W0ipFILWGpumADig= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=QYWQ7sWw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="QYWQ7sWw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 03EF3C4CEEB; Sun, 10 Aug 2025 16:52:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1754844732; bh=H5lrnGuVfOa6qujzhCSsl1dNJTJVAl2czNOoW+UpiK4=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=QYWQ7sWwtKVcBz3nKGHY/y4RA5GJsiL/0IoZas/YKtc7H5Kh2mdxJ7uWUafJjiERq SB0wYYf61IBTxvtzKuXQulbZ7Qy3pCHClMycQ2WnwLZWKtsryUpKGa7cUPaFsG9fCX ROlFjM5P749xzhOgKkWoLtUdNdw34/kG448LdvcmTIJSTpnvhDnagAz4tbNeyMl0V3 lkNVRuTLUb3tKPpg5IvOaIDTioAuHrr16tAJecyphIOnSD8R6X3T0w1b72m7W3NAqQ 4y0ya3lku3lZbnVrln5p9ds+H/+OUMPnYf6tBTRN6Bug1awSqdiata1aO+Hj+/UZap u2hq9VbOvyGbQ== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Yu Kuai , Jens Axboe , Sasha Levin , dlemoal@kernel.org, yang.yang@vivo.com, ming.lei@redhat.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 6.16-6.12] lib/sbitmap: convert shallow_depth from one word to the whole sbitmap Date: Sun, 10 Aug 2025 12:51:46 -0400 Message-Id: <20250810165158.1888206-4-sashal@kernel.org> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250810165158.1888206-1-sashal@kernel.org> References: <20250810165158.1888206-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.16 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Yu Kuai [ Upstream commit 42e6c6ce03fd3e41e39a0f93f9b1a1d9fa664338 ] Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index =3D=3D sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=3D256 and there are four words, in the worst case if user set nr_requests=3D32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=3D1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweiclou= d.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: **Backport Status: YES** This commit should be backported to stable kernel trees for the following reasons: ## Real Bug Fixes Identified 1. **Tag Allocation Overflow Bug**: The commit fixes a serious bug where more tags can be allocated than expected. As stated in the commit message: "if the last word is used, more tags can be get than expected". For example, with nr_requests=3D256 and four words, if nr_requests is set to 32, the calculation incorrectly uses 64 bits (full word size) instead of the correct limited depth. 2. **BFQ QoS Violation**: The bug specifically breaks BFQ's quality-of- service guarantees. When BFQ tries to limit a cgroup to only 1 request (shallow_depth=3D1), the bug allows "the number of words request to be allocated" instead, completely violating the intended throttling mechanism. ## Impact on Users The bugs affect critical I/O scheduling functionality: - **Performance Issues**: Async I/O can starve sync I/O by consuming more tags than intended - **Cgroup QoS Failure**: Cgroup-based I/O throttling doesn't work correctly, allowing one cgroup to starve others - **Incorrect Resource Allocation**: The system allocates more resources than configured limits ## Code Analysis Shows Clear Fixes The changes fix the calculation errors by: 1. **Converting shallow_depth semantics** from per-word to whole-sbitmap basis (see `__map_depth_with_shallow()` function) 2. **Updating all affected schedulers** (BFQ, Kyber, mq-deadline) to use the corrected calculation 3. **Simplifying sbq_calc_wake_batch()** to use the correct depth calculation ## Meets Stable Criteria - **Fixes real bugs**: Addresses actual resource allocation and QoS violations - **Obviously correct**: The math error is clear - using per-word depth when total depth is needed - **Well-tested**: Patch has been reviewed and signed-off by subsystem maintainer (Jens Axboe) - **Size reasonable**: While larger than typical, the changes are necessary to fix the bug across all affected schedulers - **No new features**: Only fixes existing broken functionality The bug causes incorrect behavior in production systems using I/O scheduling with async depth limits or BFQ cgroup QoS, making this an important fix for stable kernels. block/bfq-iosched.c | 35 ++++++++++++-------------- block/bfq-iosched.h | 3 +-- block/kyber-iosched.c | 9 ++----- block/mq-deadline.c | 16 +----------- include/linux/sbitmap.h | 6 ++--- lib/sbitmap.c | 56 +++++++++++++++++++++-------------------- 6 files changed, 52 insertions(+), 73 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 0cb1e9873aab..d68da9e92e1e 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -701,17 +701,13 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk= _mq_alloc_data *data) { struct bfq_data *bfqd =3D data->q->elevator->elevator_data; struct bfq_io_cq *bic =3D bfq_bic_lookup(data->q); - int depth; - unsigned limit =3D data->q->nr_requests; - unsigned int act_idx; + unsigned int limit, act_idx; =20 /* Sync reads have full depth available */ - if (op_is_sync(opf) && !op_is_write(opf)) { - depth =3D 0; - } else { - depth =3D bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)]; - limit =3D (limit * depth) >> bfqd->full_depth_shift; - } + if (op_is_sync(opf) && !op_is_write(opf)) + limit =3D data->q->nr_requests; + else + limit =3D bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)]; =20 for (act_idx =3D 0; bic && act_idx < bfqd->num_actuators; act_idx++) { /* Fast path to check if bfqq is already allocated. */ @@ -725,14 +721,16 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk= _mq_alloc_data *data) * available requests and thus starve other entities. */ if (bfqq_request_over_limit(bfqd, bic, opf, act_idx, limit)) { - depth =3D 1; + limit =3D 1; break; } } + bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u", - __func__, bfqd->wr_busy_queues, op_is_sync(opf), depth); - if (depth) - data->shallow_depth =3D depth; + __func__, bfqd->wr_busy_queues, op_is_sync(opf), limit); + + if (limit < data->q->nr_requests) + data->shallow_depth =3D limit; } =20 static struct bfq_queue * @@ -7128,9 +7126,8 @@ void bfq_put_async_queues(struct bfq_data *bfqd, stru= ct bfq_group *bfqg) */ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue = *bt) { - unsigned int depth =3D 1U << bt->sb.shift; + unsigned int nr_requests =3D bfqd->queue->nr_requests; =20 - bfqd->full_depth_shift =3D bt->sb.shift; /* * In-word depths if no bfq_queue is being weight-raised: * leaving 25% of tags only for sync reads. @@ -7142,13 +7139,13 @@ static void bfq_update_depths(struct bfq_data *bfqd= , struct sbitmap_queue *bt) * limit 'something'. */ /* no more than 50% of tags for async I/O */ - bfqd->word_depths[0][0] =3D max(depth >> 1, 1U); + bfqd->async_depths[0][0] =3D max(nr_requests >> 1, 1U); /* * no more than 75% of tags for sync writes (25% extra tags * w.r.t. async I/O, to prevent async I/O from starving sync * writes) */ - bfqd->word_depths[0][1] =3D max((depth * 3) >> 2, 1U); + bfqd->async_depths[0][1] =3D max((nr_requests * 3) >> 2, 1U); =20 /* * In-word depths in case some bfq_queue is being weight- @@ -7158,9 +7155,9 @@ static void bfq_update_depths(struct bfq_data *bfqd, = struct sbitmap_queue *bt) * shortage. */ /* no more than ~18% of tags for async I/O */ - bfqd->word_depths[1][0] =3D max((depth * 3) >> 4, 1U); + bfqd->async_depths[1][0] =3D max((nr_requests * 3) >> 4, 1U); /* no more than ~37% of tags for sync writes (~20% extra tags) */ - bfqd->word_depths[1][1] =3D max((depth * 6) >> 4, 1U); + bfqd->async_depths[1][1] =3D max((nr_requests * 6) >> 4, 1U); } =20 static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx) diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 687a3a7ba784..31217f196f4f 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -813,8 +813,7 @@ struct bfq_data { * Depth limits used in bfq_limit_depth (see comments on the * function) */ - unsigned int word_depths[2][2]; - unsigned int full_depth_shift; + unsigned int async_depths[2][2]; =20 /* * Number of independent actuators. This is equal to 1 in diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 4dba8405bd01..bfd9a40bb33d 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -157,10 +157,7 @@ struct kyber_queue_data { */ struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; =20 - /* - * Async request percentage, converted to per-word depth for - * sbitmap_get_shallow(). - */ + /* Number of allowed async requests. */ unsigned int async_depth; =20 struct kyber_cpu_latency __percpu *cpu_latency; @@ -454,10 +451,8 @@ static void kyber_depth_updated(struct blk_mq_hw_ctx *= hctx) { struct kyber_queue_data *kqd =3D hctx->queue->elevator->elevator_data; struct blk_mq_tags *tags =3D hctx->sched_tags; - unsigned int shift =3D tags->bitmap_tags.sb.shift; - - kqd->async_depth =3D (1U << shift) * KYBER_ASYNC_PERCENT / 100U; =20 + kqd->async_depth =3D hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100= U; sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth); } =20 diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 2edf1cac06d5..9ab6c6256695 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -487,20 +487,6 @@ static struct request *dd_dispatch_request(struct blk_= mq_hw_ctx *hctx) return rq; } =20 -/* - * 'depth' is a number in the range 1..INT_MAX representing a number of - * requests. Scale it with a factor (1 << bt->sb.shift) / q->nr_requests s= ince - * 1..(1 << bt->sb.shift) is the range expected by sbitmap_get_shallow(). - * Values larger than q->nr_requests have the same effect as q->nr_request= s. - */ -static int dd_to_word_depth(struct blk_mq_hw_ctx *hctx, unsigned int qdept= h) -{ - struct sbitmap_queue *bt =3D &hctx->sched_tags->bitmap_tags; - const unsigned int nrr =3D hctx->queue->nr_requests; - - return ((qdepth << bt->sb.shift) + nrr - 1) / nrr; -} - /* * Called by __blk_mq_alloc_request(). The shallow_depth value set by this * function is used by __blk_mq_get_tag(). @@ -517,7 +503,7 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq= _alloc_data *data) * Throttle asynchronous requests and writes such that these requests * do not block the allocation of synchronous requests. */ - data->shallow_depth =3D dd_to_word_depth(data->hctx, dd->async_depth); + data->shallow_depth =3D dd->async_depth; } =20 /* Called by blk_mq_update_nr_requests(). */ diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 189140bf11fc..4adf4b364fcd 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -213,12 +213,12 @@ int sbitmap_get(struct sbitmap *sb); * sbitmap_get_shallow() - Try to allocate a free bit from a &struct sbitm= ap, * limiting the depth used from each word. * @sb: Bitmap to allocate from. - * @shallow_depth: The maximum number of bits to allocate from a single wo= rd. + * @shallow_depth: The maximum number of bits to allocate from the bitmap. * * This rather specific operation allows for having multiple users with * different allocation limits. E.g., there can be a high-priority class t= hat * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shall= ow() - * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority + * with a @shallow_depth of (sb->depth >> 1). Then, the low-priority * class can only allocate half of the total bits in the bitmap, preventin= g it * from starving out the high-priority class. * @@ -478,7 +478,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_= queue *sbq, int nr_tags, * sbitmap_queue, limiting the depth used from each word, with preemption * already disabled. * @sbq: Bitmap queue to allocate from. - * @shallow_depth: The maximum number of bits to allocate from a single wo= rd. + * @shallow_depth: The maximum number of bits to allocate from the queue. * See sbitmap_get_shallow(). * * If you call this, make sure to call sbitmap_queue_min_shallow_depth() a= fter diff --git a/lib/sbitmap.c b/lib/sbitmap.c index d3412984170c..c07e3cd82e29 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -208,8 +208,28 @@ static int sbitmap_find_bit_in_word(struct sbitmap_wor= d *map, return nr; } =20 +static unsigned int __map_depth_with_shallow(const struct sbitmap *sb, + int index, + unsigned int shallow_depth) +{ + u64 shallow_word_depth; + unsigned int word_depth, reminder; + + word_depth =3D __map_depth(sb, index); + if (shallow_depth >=3D sb->depth) + return word_depth; + + shallow_word_depth =3D word_depth * shallow_depth; + reminder =3D do_div(shallow_word_depth, sb->depth); + + if (reminder >=3D (index + 1) * word_depth) + shallow_word_depth++; + + return (unsigned int)shallow_word_depth; +} + static int sbitmap_find_bit(struct sbitmap *sb, - unsigned int depth, + unsigned int shallow_depth, unsigned int index, unsigned int alloc_hint, bool wrap) @@ -218,12 +238,12 @@ static int sbitmap_find_bit(struct sbitmap *sb, int nr =3D -1; =20 for (i =3D 0; i < sb->map_nr; i++) { - nr =3D sbitmap_find_bit_in_word(&sb->map[index], - min_t(unsigned int, - __map_depth(sb, index), - depth), - alloc_hint, wrap); + unsigned int depth =3D __map_depth_with_shallow(sb, index, + shallow_depth); =20 + if (depth) + nr =3D sbitmap_find_bit_in_word(&sb->map[index], depth, + alloc_hint, wrap); if (nr !=3D -1) { nr +=3D index << sb->shift; break; @@ -406,27 +426,9 @@ EXPORT_SYMBOL_GPL(sbitmap_bitmap_show); static unsigned int sbq_calc_wake_batch(struct sbitmap_queue *sbq, unsigned int depth) { - unsigned int wake_batch; - unsigned int shallow_depth; - - /* - * Each full word of the bitmap has bits_per_word bits, and there might - * be a partial word. There are depth / bits_per_word full words and - * depth % bits_per_word bits left over. In bitwise arithmetic: - * - * bits_per_word =3D 1 << shift - * depth / bits_per_word =3D depth >> shift - * depth % bits_per_word =3D depth & ((1 << shift) - 1) - * - * Each word can be limited to sbq->min_shallow_depth bits. - */ - shallow_depth =3D min(1U << sbq->sb.shift, sbq->min_shallow_depth); - depth =3D ((depth >> sbq->sb.shift) * shallow_depth + - min(depth & ((1U << sbq->sb.shift) - 1), shallow_depth)); - wake_batch =3D clamp_t(unsigned int, depth / SBQ_WAIT_QUEUES, 1, - SBQ_WAKE_BATCH); - - return wake_batch; + return clamp_t(unsigned int, + min(depth, sbq->min_shallow_depth) / SBQ_WAIT_QUEUES, + 1, SBQ_WAKE_BATCH); } =20 int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, --=20 2.39.5