drivers/block/Kconfig | 10 ++++++++++ drivers/block/loop.c | 22 +++++++++++++++++++++- 2 files changed, 31 insertions(+), 1 deletion(-)
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
For now, my android system with per pid memcgv2 setup are suffering
high block_rq_issue to block_rq_complete latency which is actually
introduced by schedule latency of too many kworker threads. By further
investigation, we found that the EAS scheduler which will pack small
load tasks into one CPU core will make this scenario worse. This commit
would like to introduce a way of synchronized read to be helpful on
this scenario. The I2C of loop device's request reduced from 14ms to
2.1ms under fio test.
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
drivers/block/Kconfig | 10 ++++++++++
drivers/block/loop.c | 22 +++++++++++++++++++++-
2 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index df38fb364904..a30d6c5f466e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -383,4 +383,14 @@ config BLK_DEV_ZONED_LOOP
If unsure, say N.
+config LOOP_SYNC_READ
+ bool "enable synchronized read for loop device"
+ default n
+ help
+ provide a way of synchronized read for loop device which could be
+ helpful when you are concerned with the schedule latency affection
+ over the requests of loop device especially when plenty of blkcgs
+ setup within the system. The loop device should be configured as
+ LO_FLAGS_DIRECT_IO when applying this config.
+
endif # BLK_DEV
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 053a086d547e..1e18abe48d2b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1884,7 +1884,27 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
#endif
}
#endif
- loop_queue_work(lo, cmd);
+#ifdef CONFIG_LOOP_SYNC_READ
+ if (req_op(rq) == REQ_OP_READ && cmd->use_aio && current->plug) {
+ struct blk_plug *plug = current->plug;
+
+ current->plug = NULL;
+ /* iterate through the plug->mq_list and launch the requests to real device */
+ while (rq) {
+ loff_t pos;
+
+ cmd = blk_mq_rq_to_pdu(rq);
+ pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+ lo_rw_aio(lo, cmd, pos, ITER_DEST);
+ rq = rq_list_pop(&plug->mq_list);
+ }
+ plug->rq_count = 0;
+ current->plug = plug;
+ } else
+ loop_queue_work(lo, cmd);
+#else
+ loop_queue_work(lo, cmd);
+#endif
return BLK_STS_OK;
}
--
2.25.1
On Mon, Sep 22, 2025 at 11:29:15AM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > For now, my android system with per pid memcgv2 setup are suffering > high block_rq_issue to block_rq_complete latency which is actually > introduced by schedule latency of too many kworker threads. By further > investigation, we found that the EAS scheduler which will pack small > load tasks into one CPU core will make this scenario worse. This commit > would like to introduce a way of synchronized read to be helpful on > this scenario. The I2C of loop device's request reduced from 14ms to > 2.1ms under fio test. So fix the scheduler, or create less helper threads, but this work around really look like fixing the symptoms instead of even trying to aim for the root cause.
On Tue, Sep 23, 2025 at 2:09 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Sep 22, 2025 at 11:29:15AM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > For now, my android system with per pid memcgv2 setup are suffering > > high block_rq_issue to block_rq_complete latency which is actually > > introduced by schedule latency of too many kworker threads. By further > > investigation, we found that the EAS scheduler which will pack small > > load tasks into one CPU core will make this scenario worse. This commit > > would like to introduce a way of synchronized read to be helpful on > > this scenario. The I2C of loop device's request reduced from 14ms to > > 2.1ms under fio test. > > So fix the scheduler, or create less helper threads, but this work > around really look like fixing the symptoms instead of even trying > to aim for the root cause. Yes, we have tried to solve this case from the above perspective. As to the scheduler, packing small tasks to one core(Big core in ARM) instead of spreading them is desired for power-saving reasons. To the number of kworker threads, it is upon current design which will create new work for each blkcg. According to ANDROID's current approach, each PID takes one cgroup and correspondingly a kworker thread which actually induces this scenario. >
On 9/22/25 8:50 PM, Zhaoyang Huang wrote: > Yes, we have tried to solve this case from the above perspective. As > to the scheduler, packing small tasks to one core(Big core in ARM) > instead of spreading them is desired for power-saving reasons. To the > number of kworker threads, it is upon current design which will create > new work for each blkcg. According to ANDROID's current approach, each > PID takes one cgroup and correspondingly a kworker thread which > actually induces this scenario. More cgroups means more overhead from cgroup-internal tasks, e.g. accumulating statistics. How about requesting to the Android core team to review the approach of associating one cgroup with each PID? I'm wondering whether the approach of one cgroup per aggregate profile (SCHED_SP_BACKGROUND, SCHED_SP_FOREGROUND, ...) would work. Thanks, Bart.
loop google kernel team. When active_depth of the cgroupv2 is set to 3, the loop device's request I2C will be affected by schedule latency which is introduced by huge numbers of kworker thread corresponding to blkcg for each. What's your opinion on this RFC patch? On Wed, Sep 24, 2025 at 12:30 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 9/22/25 8:50 PM, Zhaoyang Huang wrote: > > Yes, we have tried to solve this case from the above perspective. As > > to the scheduler, packing small tasks to one core(Big core in ARM) > > instead of spreading them is desired for power-saving reasons. To the > > number of kworker threads, it is upon current design which will create > > new work for each blkcg. According to ANDROID's current approach, each > > PID takes one cgroup and correspondingly a kworker thread which > > actually induces this scenario. > > More cgroups means more overhead from cgroup-internal tasks, e.g. > accumulating statistics. How about requesting to the Android core team > to review the approach of associating one cgroup with each PID? I'm > wondering whether the approach of one cgroup per aggregate profile > (SCHED_SP_BACKGROUND, SCHED_SP_FOREGROUND, ...) would work. > > Thanks, > > Bart.
On Wed, Sep 24, 2025 at 5:13 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > loop google kernel team. When active_depth of the cgroupv2 is set to > 3, the loop device's request I2C will be affected by schedule latency > which is introduced by huge numbers of kworker thread corresponding to > blkcg for each. What's your opinion on this RFC patch? There are some issues on this RFC patch: - current->plug can't be touched by driver, cause there can be request from other devices - you can't sleep in loop_queue_rq() The following patchset should address your issue, and I can rebase & resend if no one objects. https://lore.kernel.org/linux-block/20250322012617.354222-1-ming.lei@redhat.com/ Thanks, > > On Wed, Sep 24, 2025 at 12:30 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > On 9/22/25 8:50 PM, Zhaoyang Huang wrote: > > > Yes, we have tried to solve this case from the above perspective. As > > > to the scheduler, packing small tasks to one core(Big core in ARM) > > > instead of spreading them is desired for power-saving reasons. To the > > > number of kworker threads, it is upon current design which will create > > > new work for each blkcg. According to ANDROID's current approach, each > > > PID takes one cgroup and correspondingly a kworker thread which > > > actually induces this scenario. > > > > More cgroups means more overhead from cgroup-internal tasks, e.g. > > accumulating statistics. How about requesting to the Android core team > > to review the approach of associating one cgroup with each PID? I'm > > wondering whether the approach of one cgroup per aggregate profile > > (SCHED_SP_BACKGROUND, SCHED_SP_FOREGROUND, ...) would work. > > > > Thanks, > > > > Bart. >
© 2016 - 2025 Red Hat, Inc.