[v2] blk-mq-sched: support request batch dispatching for sq elevator

[PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 2 months, 1 week ago

From: Yu Kuai <yukuai3@huawei.com>

Changes from v1:
 - the ioc changes are send separately;
 - change the patch 1-3 order as suggested by Damien;

Currently, both mq-deadline and bfq have global spin lock that will be
grabbed inside elevator methods like dispatch_request, insert_requests,
and bio_merge. And the global lock is the main reason mq-deadline and
bfq can't scale very well.

For dispatch_request method, current behavior is dispatching one request at
a time. In the case of multiple dispatching contexts, This behavior, on the
one hand, introduce intense lock contention:

t1:                     t2:                     t3:
lock                    lock                    lock
// grab lock
ops.dispatch_request
unlock
                        // grab lock
                        ops.dispatch_request
                        unlock
                                                // grab lock
                                                ops.dispatch_request
                                                unlock

on the other hand, messing up the requests dispatching order:
t1:

lock
rq1 = ops.dispatch_request
unlock
                        t2:
                        lock
                        rq2 = ops.dispatch_request
                        unlock

lock
rq3 = ops.dispatch_request
unlock

                        lock
                        rq4 = ops.dispatch_request
                        unlock

//rq1,rq3 issue to disk
                        // rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.

While dispatching request, blk_mq_get_disatpch_budget() and
blk_mq_get_driver_tag() must be called, and they are not ready to be
called inside elevator methods, hence introduce a new method like
dispatch_requests is not possible.

In conclusion, this set factor the global lock out of dispatch_request
method, and support request batch dispatch by calling the methods
multiple time while holding the lock.

nullblk setup:
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 128 > submit_queues &&
    echo 1024 > hw_queue_depth &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

Test script:
fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
  -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
  -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30

Test result: iops

|                 | deadline | bfq      |
| --------------- | -------- | -------- |
| before this set | 263k     | 124k     |
| after this set  | 475k     | 292k     |

Yu Kuai (5):
  blk-mq-sched: introduce high level elevator lock
  mq-deadline: switch to use elevator lock
  block, bfq: switch to use elevator lock
  blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  blk-mq-sched: support request batch dispatching for sq elevator

 block/bfq-cgroup.c   |   4 +-
 block/bfq-iosched.c  |  49 +++++----
 block/bfq-iosched.h  |   2 +-
 block/blk-mq-sched.c | 241 ++++++++++++++++++++++++++++++-------------
 block/blk-mq.h       |  21 ++++
 block/elevator.c     |   1 +
 block/elevator.h     |   4 +-
 block/mq-deadline.c  |  58 +++++------
 8 files changed, 248 insertions(+), 132 deletions(-)

-- 
2.39.2

Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Ming Lei 2 months ago

On Wed, Jul 30, 2025 at 04:22:02PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Changes from v1:
>  - the ioc changes are send separately;
>  - change the patch 1-3 order as suggested by Damien;
> 
> Currently, both mq-deadline and bfq have global spin lock that will be
> grabbed inside elevator methods like dispatch_request, insert_requests,
> and bio_merge. And the global lock is the main reason mq-deadline and
> bfq can't scale very well.
> 
> For dispatch_request method, current behavior is dispatching one request at
> a time. In the case of multiple dispatching contexts, This behavior, on the
> one hand, introduce intense lock contention:
> 
> t1:                     t2:                     t3:
> lock                    lock                    lock
> // grab lock
> ops.dispatch_request
> unlock
>                         // grab lock
>                         ops.dispatch_request
>                         unlock
>                                                 // grab lock
>                                                 ops.dispatch_request
>                                                 unlock
> 
> on the other hand, messing up the requests dispatching order:
> t1:
> 
> lock
> rq1 = ops.dispatch_request
> unlock
>                         t2:
>                         lock
>                         rq2 = ops.dispatch_request
>                         unlock
> 
> lock
> rq3 = ops.dispatch_request
> unlock
> 
>                         lock
>                         rq4 = ops.dispatch_request
>                         unlock
> 
> //rq1,rq3 issue to disk
>                         // rq2, rq4 issue to disk
> 
> In this case, the elevator dispatch order is rq 1-2-3-4, however,
> such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.
> 
> While dispatching request, blk_mq_get_disatpch_budget() and
> blk_mq_get_driver_tag() must be called, and they are not ready to be
> called inside elevator methods, hence introduce a new method like
> dispatch_requests is not possible.
> 
> In conclusion, this set factor the global lock out of dispatch_request
> method, and support request batch dispatch by calling the methods
> multiple time while holding the lock.
> 
> nullblk setup:
> modprobe null_blk nr_devices=0 &&
>     udevadm settle &&
>     cd /sys/kernel/config/nullb &&
>     mkdir nullb0 &&
>     cd nullb0 &&
>     echo 0 > completion_nsec &&
>     echo 512 > blocksize &&
>     echo 0 > home_node &&
>     echo 0 > irqmode &&
>     echo 128 > submit_queues &&
>     echo 1024 > hw_queue_depth &&
>     echo 1024 > size &&
>     echo 0 > memory_backed &&
>     echo 2 > queue_mode &&
>     echo 1 > power ||
>     exit $?
> 
> Test script:
> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>   -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>   -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
> 
> Test result: iops
> 
> |                 | deadline | bfq      |
> | --------------- | -------- | -------- |
> | before this set | 263k     | 124k     |
> | after this set  | 475k     | 292k     |

batch dispatch may hurt io merge performance which is important for
elevator, so please provide test data on real HDD. & SSD., instead of
null_blk only, and it can be perfect if merge sensitive workload
is evaluated.



Thanks,
Ming

Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 2 months ago

Hi,

在 2025/07/31 16:18, Ming Lei 写道:
> batch dispatch may hurt io merge performance which is important for
> elevator, so please provide test data on real HDD. & SSD., instead of
> null_blk only, and it can be perfect if merge sensitive workload
> is evaluated.

Ok, I'll provide test data on HDD and SSD that I have for now.

For the elevator IO merge case, what I have in mind is that we issue
small sequential IO one by one with multiple contexts, so that bios
won't be merged in plug, and this will require IO issue > IO done, is
this case enough?

Thanks,
Kuai

Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Ming Lei 2 months ago

On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/31 16:18, Ming Lei 写道:
> > batch dispatch may hurt io merge performance which is important for
> > elevator, so please provide test data on real HDD. & SSD., instead of
> > null_blk only, and it can be perfect if merge sensitive workload
> > is evaluated.
> 
> Ok, I'll provide test data on HDD and SSD that I have for now.
> 
> For the elevator IO merge case, what I have in mind is that we issue
> small sequential IO one by one with multiple contexts, so that bios
> won't be merged in plug, and this will require IO issue > IO done, is
> this case enough?

Long time ago, I investigated one such issue which is triggered in qemu
workload, but not sure if I can find it now.

Also many scsi devices may easily run into queue busy, then scheduler merge
starts to work, and it may perform worse if you dispatch more in this
situation.

Thanks,
Ming

Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 2 months ago

Hi,

在 2025/07/31 17:25, Ming Lei 写道:
> On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/31 16:18, Ming Lei 写道:
>>> batch dispatch may hurt io merge performance which is important for
>>> elevator, so please provide test data on real HDD. & SSD., instead of
>>> null_blk only, and it can be perfect if merge sensitive workload
>>> is evaluated.
>>
>> Ok, I'll provide test data on HDD and SSD that I have for now.
>>
>> For the elevator IO merge case, what I have in mind is that we issue
>> small sequential IO one by one with multiple contexts, so that bios
>> won't be merged in plug, and this will require IO issue > IO done, is
>> this case enough?
> 
> Long time ago, I investigated one such issue which is triggered in qemu
> workload, but not sure if I can find it now.
> 
> Also many scsi devices may easily run into queue busy, then scheduler merge
> starts to work, and it may perform worse if you dispatch more in this
> situation.

I think we won't dispatch more in this case, on the one hand we will get
budgets first, to make sure never dispatch more than queue_depth; on the
onther hand, in the case hctx->dispatch_busy is set, we still fallback
to the old case to dispatch one at a time;

So if the IO merge start because many rqs are accumulated inside
elevator and the driver if full, I will expect IO merge will behave the
same with this set. I'll have a test to check.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
>

Re: [PATCH v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Ming Lei 2 months ago

On Thu, Jul 31, 2025 at 05:33:24PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/31 17:25, Ming Lei 写道:
> > On Thu, Jul 31, 2025 at 04:42:10PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/07/31 16:18, Ming Lei 写道:
> > > > batch dispatch may hurt io merge performance which is important for
> > > > elevator, so please provide test data on real HDD. & SSD., instead of
> > > > null_blk only, and it can be perfect if merge sensitive workload
> > > > is evaluated.
> > > 
> > > Ok, I'll provide test data on HDD and SSD that I have for now.
> > > 
> > > For the elevator IO merge case, what I have in mind is that we issue
> > > small sequential IO one by one with multiple contexts, so that bios
> > > won't be merged in plug, and this will require IO issue > IO done, is
> > > this case enough?
> > 
> > Long time ago, I investigated one such issue which is triggered in qemu
> > workload, but not sure if I can find it now.
> > 
> > Also many scsi devices may easily run into queue busy, then scheduler merge
> > starts to work, and it may perform worse if you dispatch more in this
> > situation.
> 
> I think we won't dispatch more in this case, on the one hand we will get
> budgets first, to make sure never dispatch more than queue_depth; on the

OK.

> onther hand, in the case hctx->dispatch_busy is set, we still fallback
> to the old case to dispatch one at a time;

hctx->dispatch_busy is lockless, all request may get dispatched before
hctx->dispatch_busy is set.


Thanks,
Ming