block/bfq-cgroup.c | 4 +- block/bfq-iosched.c | 49 +++++---- block/bfq-iosched.h | 2 +- block/blk-mq-sched.c | 241 ++++++++++++++++++++++++++++++------------- block/blk-mq.h | 21 ++++ block/elevator.c | 1 + block/elevator.h | 4 +- block/mq-deadline.c | 58 +++++------ 8 files changed, 248 insertions(+), 132 deletions(-)
From: Yu Kuai <yukuai3@huawei.com> Changes from v1: - the ioc changes are send separately; - change the patch 1-3 order as suggested by Damien; Currently, both mq-deadline and bfq have global spin lock that will be grabbed inside elevator methods like dispatch_request, insert_requests, and bio_merge. And the global lock is the main reason mq-deadline and bfq can't scale very well. For dispatch_request method, current behavior is dispatching one request at a time. In the case of multiple dispatching contexts, This behavior, on the one hand, introduce intense lock contention: t1: t2: t3: lock lock lock // grab lock ops.dispatch_request unlock // grab lock ops.dispatch_request unlock // grab lock ops.dispatch_request unlock on the other hand, messing up the requests dispatching order: t1: lock rq1 = ops.dispatch_request unlock t2: lock rq2 = ops.dispatch_request unlock lock rq3 = ops.dispatch_request unlock lock rq4 = ops.dispatch_request unlock //rq1,rq3 issue to disk // rq2, rq4 issue to disk In this case, the elevator dispatch order is rq 1-2-3-4, however, such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed. While dispatching request, blk_mq_get_disatpch_budget() and blk_mq_get_driver_tag() must be called, and they are not ready to be called inside elevator methods, hence introduce a new method like dispatch_requests is not possible. In conclusion, this set factor the global lock out of dispatch_request method, and support request batch dispatch by calling the methods multiple time while holding the lock. nullblk setup: modprobe null_blk nr_devices=0 && udevadm settle && cd /sys/kernel/config/nullb && mkdir nullb0 && cd nullb0 && echo 0 > completion_nsec && echo 512 > blocksize && echo 0 > home_node && echo 0 > irqmode && echo 128 > submit_queues && echo 1024 > hw_queue_depth && echo 1024 > size && echo 0 > memory_backed && echo 2 > queue_mode && echo 1 > power || exit $? Test script: fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \ -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \ -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30 Test result: iops | | deadline | bfq | | --------------- | -------- | -------- | | before this set | 263k | 124k | | after this set | 475k | 292k | Yu Kuai (5): blk-mq-sched: introduce high level elevator lock mq-deadline: switch to use elevator lock block, bfq: switch to use elevator lock blk-mq-sched: refactor __blk_mq_do_dispatch_sched() blk-mq-sched: support request batch dispatching for sq elevator block/bfq-cgroup.c | 4 +- block/bfq-iosched.c | 49 +++++---- block/bfq-iosched.h | 2 +- block/blk-mq-sched.c | 241 ++++++++++++++++++++++++++++++------------- block/blk-mq.h | 21 ++++ block/elevator.c | 1 + block/elevator.h | 4 +- block/mq-deadline.c | 58 +++++------ 8 files changed, 248 insertions(+), 132 deletions(-) -- 2.39.2
On Wed, Jul 30, 2025 at 04:22:02PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > Changes from v1: > - the ioc changes are send separately; > - change the patch 1-3 order as suggested by Damien; > > Currently, both mq-deadline and bfq have global spin lock that will be > grabbed inside elevator methods like dispatch_request, insert_requests, > and bio_merge. And the global lock is the main reason mq-deadline and > bfq can't scale very well. > > For dispatch_request method, current behavior is dispatching one request at > a time. In the case of multiple dispatching contexts, This behavior, on the > one hand, introduce intense lock contention: > > t1: t2: t3: > lock lock lock > // grab lock > ops.dispatch_request > unlock > // grab lock > ops.dispatch_request > unlock > // grab lock > ops.dispatch_request > unlock > > on the other hand, messing up the requests dispatching order: > t1: > > lock > rq1 = ops.dispatch_request > unlock > t2: > lock > rq2 = ops.dispatch_request > unlock > > lock > rq3 = ops.dispatch_request > unlock > > lock > rq4 = ops.dispatch_request > unlock > > //rq1,rq3 issue to disk > // rq2, rq4 issue to disk > > In this case, the elevator dispatch order is rq 1-2-3-4, however, > such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed. > > While dispatching request, blk_mq_get_disatpch_budget() and > blk_mq_get_driver_tag() must be called, and they are not ready to be > called inside elevator methods, hence introduce a new method like > dispatch_requests is not possible. > > In conclusion, this set factor the global lock out of dispatch_request > method, and support request batch dispatch by calling the methods > multiple time while holding the lock. > > nullblk setup: > modprobe null_blk nr_devices=0 && > udevadm settle && > cd /sys/kernel/config/nullb && > mkdir nullb0 && > cd nullb0 && > echo 0 > completion_nsec && > echo 512 > blocksize && > echo 0 > home_node && > echo 0 > irqmode && > echo 128 > submit_queues && > echo 1024 > hw_queue_depth && > echo 1024 > size && > echo 0 > memory_backed && > echo 2 > queue_mode && > echo 1 > power || > exit $? > > Test script: > fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \ > -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \ > -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30 > > Test result: iops > > | | deadline | bfq | > | --------------- | -------- | -------- | > | before this set | 263k | 124k | > | after this set | 475k | 292k | batch dispatch may hurt io merge performance which is important for elevator, so please provide test data on real HDD. & SSD., instead of null_blk only, and it can be perfect if merge sensitive workload is evaluated. Thanks, Ming
Hi, 在 2025/07/31 16:18, Ming Lei 写道: > batch dispatch may hurt io merge performance which is important for > elevator, so please provide test data on real HDD. & SSD., instead of > null_blk only, and it can be perfect if merge sensitive workload > is evaluated. Ok, I'll provide test data on HDD and SSD that I have for now. For the elevator IO merge case, what I have in mind is that we issue small sequential IO one by one with multiple contexts, so that bios won't be merged in plug, and this will require IO issue > IO done, is this case enough? Thanks, Kuai
On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote: > Hi, > > 在 2025/07/31 16:18, Ming Lei 写道: > > batch dispatch may hurt io merge performance which is important for > > elevator, so please provide test data on real HDD. & SSD., instead of > > null_blk only, and it can be perfect if merge sensitive workload > > is evaluated. > > Ok, I'll provide test data on HDD and SSD that I have for now. > > For the elevator IO merge case, what I have in mind is that we issue > small sequential IO one by one with multiple contexts, so that bios > won't be merged in plug, and this will require IO issue > IO done, is > this case enough? Long time ago, I investigated one such issue which is triggered in qemu workload, but not sure if I can find it now. Also many scsi devices may easily run into queue busy, then scheduler merge starts to work, and it may perform worse if you dispatch more in this situation. Thanks, Ming
Hi, 在 2025/07/31 17:25, Ming Lei 写道: > On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2025/07/31 16:18, Ming Lei 写道: >>> batch dispatch may hurt io merge performance which is important for >>> elevator, so please provide test data on real HDD. & SSD., instead of >>> null_blk only, and it can be perfect if merge sensitive workload >>> is evaluated. >> >> Ok, I'll provide test data on HDD and SSD that I have for now. >> >> For the elevator IO merge case, what I have in mind is that we issue >> small sequential IO one by one with multiple contexts, so that bios >> won't be merged in plug, and this will require IO issue > IO done, is >> this case enough? > > Long time ago, I investigated one such issue which is triggered in qemu > workload, but not sure if I can find it now. > > Also many scsi devices may easily run into queue busy, then scheduler merge > starts to work, and it may perform worse if you dispatch more in this > situation. I think we won't dispatch more in this case, on the one hand we will get budgets first, to make sure never dispatch more than queue_depth; on the onther hand, in the case hctx->dispatch_busy is set, we still fallback to the old case to dispatch one at a time; So if the IO merge start because many rqs are accumulated inside elevator and the driver if full, I will expect IO merge will behave the same with this set. I'll have a test to check. Thanks, Kuai > > Thanks, > Ming > > > . >
On Thu, Jul 31, 2025 at 05:33:24PM +0800, Yu Kuai wrote: > Hi, > > 在 2025/07/31 17:25, Ming Lei 写道: > > On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote: > > > Hi, > > > > > > 在 2025/07/31 16:18, Ming Lei 写道: > > > > batch dispatch may hurt io merge performance which is important for > > > > elevator, so please provide test data on real HDD. & SSD., instead of > > > > null_blk only, and it can be perfect if merge sensitive workload > > > > is evaluated. > > > > > > Ok, I'll provide test data on HDD and SSD that I have for now. > > > > > > For the elevator IO merge case, what I have in mind is that we issue > > > small sequential IO one by one with multiple contexts, so that bios > > > won't be merged in plug, and this will require IO issue > IO done, is > > > this case enough? > > > > Long time ago, I investigated one such issue which is triggered in qemu > > workload, but not sure if I can find it now. > > > > Also many scsi devices may easily run into queue busy, then scheduler merge > > starts to work, and it may perform worse if you dispatch more in this > > situation. > > I think we won't dispatch more in this case, on the one hand we will get > budgets first, to make sure never dispatch more than queue_depth; on the OK. > onther hand, in the case hctx->dispatch_busy is set, we still fallback > to the old case to dispatch one at a time; hctx->dispatch_busy is lockless, all request may get dispatched before hctx->dispatch_busy is set. Thanks, Ming
© 2016 - 2025 Red Hat, Inc.