[v2] blk-mq-sched: support request batch dispatching for sq elevator

[PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 3 months, 3 weeks ago

From: Yu Kuai <yukuai3@huawei.com>

Before this patch, each dispatch context will hold a global lock to
dispatch one request at a time, which introduce intense lock competition:

lock
ops.dispatch_request
unlock

Hence support dispatch a batch of requests while holding the lock to
reduce lock contention.

nullblk setup:
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 128 > submit_queues &&
    echo 1024 > hw_queue_depth &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

Test script:
fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
  -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
  -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30

Test result(elevator is deadline): iops
|                 | null_blk | scsi hdd |
| --------------- | -------- | -------- |
| before this set | 263k     | 24       |
| after this set  | 475k     | 272      |

Yu Kuai (5):
  elevator: introduce global lock for sq_shared elevator
  mq-deadline: switch to use elevator lock
  block, bfq: switch to use elevator lock
  blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  blk-mq-sched: support request batch dispatching for sq elevator

 block/bfq-cgroup.c   |   4 +-
 block/bfq-iosched.c  |  53 ++++------
 block/bfq-iosched.h  |   2 +-
 block/blk-mq-sched.c | 226 +++++++++++++++++++++++++++++--------------
 block/blk-mq.c       |   5 +-
 block/blk-mq.h       |  21 ++++
 block/elevator.c     |   1 +
 block/elevator.h     |  61 +++++++++++-
 block/mq-deadline.c  |  58 +++++------
 9 files changed, 282 insertions(+), 149 deletions(-)

-- 
2.39.2

Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Damien Le Moal 3 months, 3 weeks ago

On 6/14/25 18:25, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Before this patch, each dispatch context will hold a global lock to
> dispatch one request at a time, which introduce intense lock competition:
> 
> lock
> ops.dispatch_request
> unlock
> 
> Hence support dispatch a batch of requests while holding the lock to
> reduce lock contention.
> 
> nullblk setup:
> modprobe null_blk nr_devices=0 &&
>     udevadm settle &&
>     cd /sys/kernel/config/nullb &&
>     mkdir nullb0 &&
>     cd nullb0 &&
>     echo 0 > completion_nsec &&
>     echo 512 > blocksize &&
>     echo 0 > home_node &&
>     echo 0 > irqmode &&
>     echo 128 > submit_queues &&
>     echo 1024 > hw_queue_depth &&
>     echo 1024 > size &&
>     echo 0 > memory_backed &&
>     echo 2 > queue_mode &&
>     echo 1 > power ||
>     exit $?
> 
> Test script:
> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>   -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>   -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
> 
> Test result(elevator is deadline): iops
> |                 | null_blk | scsi hdd |
> | --------------- | -------- | -------- |
> | before this set | 263k     | 24       |
> | after this set  | 475k     | 272      |

For the HDD, these numbers are very low, and I do not understand how you can get
any improvement from reducing lock contention, since contention should not be an
issue with this kind of performance. What HW did you use for testing ? Was this
a VM ?

I tested this null_blk setup and your fio command on a bare-metal 16-cores Xeon
machine. For the scsi disk, I used a 26TB SATA HDD connected to an AHCI port).
With this setup, results are like this:

|                 | null_blk | hdd (write) | hdd (read) |
| --------------- | -------- | ----------- | ---------- |
| before this set | 613k     | 1088        | 211        |
| after this set  | 940k     | 1093        | 212        |

So not surprisingly, there is no improvement for the SATA HDD because of the low
max IOPS these devices can achieve: lock contention is not really an issue when
you are dealing with a slow device. And a SAS HDD will be the same. Gains may
likely be more significant with a fast SAS/FC RAID array but I do not have
access to that.

But the improvement for a fast device like null_blk is indeed excellent (+53%).

With LOCKDEP & KASAN disabled, the results are like this:

|                 | null_blk | hdd (write) | hdd (read) |
| --------------- | -------- | ----------- | ---------- |
| before this set | 625k     | 1092        | 213        |
| after this set  | 984k     | 1095        | 215        |

No real changes for the HDD, as expected, and the improvement for null_blk is
still good.

So maybe drop the RFC tag on these patches and repost after cleaning things up ?

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 3 months, 3 weeks ago

Hi,

在 2025/06/16 12:03, Damien Le Moal 写道:
> On 6/14/25 18:25, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Before this patch, each dispatch context will hold a global lock to
>> dispatch one request at a time, which introduce intense lock competition:
>>
>> lock
>> ops.dispatch_request
>> unlock
>>
>> Hence support dispatch a batch of requests while holding the lock to
>> reduce lock contention.
>>
>> nullblk setup:
>> modprobe null_blk nr_devices=0 &&
>>      udevadm settle &&
>>      cd /sys/kernel/config/nullb &&
>>      mkdir nullb0 &&
>>      cd nullb0 &&
>>      echo 0 > completion_nsec &&
>>      echo 512 > blocksize &&
>>      echo 0 > home_node &&
>>      echo 0 > irqmode &&
>>      echo 128 > submit_queues &&
>>      echo 1024 > hw_queue_depth &&
>>      echo 1024 > size &&
>>      echo 0 > memory_backed &&
>>      echo 2 > queue_mode &&
>>      echo 1 > power ||
>>      exit $?
>>
>> Test script:
>> fio -filename=/dev/$disk -name=test -rw=randwrite -bs=4k -iodepth=32 \
>>    -numjobs=16 --iodepth_batch_submit=8 --iodepth_batch_complete=8 \
>>    -direct=1 -ioengine=io_uring -group_reporting -time_based -runtime=30
>>
>> Test result(elevator is deadline): iops
>> |                 | null_blk | scsi hdd |
>> | --------------- | -------- | -------- |
>> | before this set | 263k     | 24       |
>> | after this set  | 475k     | 272      |
> 
> For the HDD, these numbers are very low, and I do not understand how you can get
> any improvement from reducing lock contention, since contention should not be an
> issue with this kind of performance. What HW did you use for testing ? Was this
> a VM ?
> 

Thanks for reviewing this RFC set! I'm curious why there are improvement
as well, I didn't have the answer when I sent this set.

I'm testing on 256-core Kunpeng-920 server, with MG04ACA600E, 5TB HDD,
attched to hisi_sas_v3, and the disk have beed used for testing for more
than 5 years, perhaps this is why randwrite numbers are so low.

> I tested this null_blk setup and your fio command on a bare-metal 16-cores Xeon
> machine. For the scsi disk, I used a 26TB SATA HDD connected to an AHCI port).
> With this setup, results are like this:
> 
> |                 | null_blk | hdd (write) | hdd (read) |
> | --------------- | -------- | ----------- | ---------- |
> | before this set | 613k     | 1088        | 211        |
> | after this set  | 940k     | 1093        | 212        |
> 
> So not surprisingly, there is no improvement for the SATA HDD because of the low
> max IOPS these devices can achieve: lock contention is not really an issue when
> you are dealing with a slow device. And a SAS HDD will be the same. Gains may
> likely be more significant with a fast SAS/FC RAID array but I do not have
> access to that.
> 
> But the improvement for a fast device like null_blk is indeed excellent (+53%).
> 
> With LOCKDEP & KASAN disabled, the results are like this:
> 
> |                 | null_blk | hdd (write) | hdd (read) |
> | --------------- | -------- | ----------- | ---------- |
> | before this set | 625k     | 1092        | 213        |
> | after this set  | 984k     | 1095        | 215        |
> 
> No real changes for the HDD, as expected, and the improvement for null_blk is
> still good.

I agree that lock contention here will not affect HDD performance.
What I suspect the difference in my environment is that the order of rqs
might be changed from elevator dispatching them and the disk handling
them.

For example, the order can be easily revised if more than one context
dispatch one request at a time:

t1:

lock
rq1 = dd_dispatch_request
unlock
			t2:
			lock
			rq2 = dd_dispatch_request
			unlock

lock
rq3 = dd_dispatch_request
unlock

			lock
			rq4 = dd_dispatch_request
			unlock

//rq1,rq3 issue to disk
			// rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4.

And with batch requests dispatch, will this less likely to happen?
> 
> So maybe drop the RFC tag on these patches and repost after cleaning things up ?

Sure, thanks again for reviewing this RFC set.
Kuai

> 
>

Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Damien Le Moal 3 months, 3 weeks ago

On 6/16/25 16:22, Yu Kuai wrote:
> I agree that lock contention here will not affect HDD performance.
> What I suspect the difference in my environment is that the order of rqs
> might be changed from elevator dispatching them and the disk handling
> them.
> 
> For example, the order can be easily revised if more than one context
> dispatch one request at a time:
> 
> t1:
> 
> lock
> rq1 = dd_dispatch_request
> unlock
> 			t2:
> 			lock
> 			rq2 = dd_dispatch_request
> 			unlock
> 
> lock
> rq3 = dd_dispatch_request
> unlock
> 
> 			lock
> 			rq4 = dd_dispatch_request
> 			unlock
> 
> //rq1,rq3 issue to disk
> 			// rq2, rq4 issue to disk
> 
> In this case, the elevator dispatch order is rq 1-2-3-4, however,
> such order in disk is rq 1-3-2-4.
> 
> And with batch requests dispatch, will this less likely to happen?

If you are running a write test with the HDD write cache enabled, such
reordering will most liley not matter at all. Running the same workload with
"none" and I get the same IOPS for writes.

Check your disk. If you do have the HDD write cache disabled, then sure, the
order will matter more depending on how your drive handles WCD writes (recent
drives have very similar performance with WCE and WCD).

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH RFC v2 0/5] blk-mq-sched: support request batch dispatching for sq elevator

Posted by Yu Kuai 3 months, 3 weeks ago

Hi,

在 2025/06/16 15:37, Damien Le Moal 写道:
> On 6/16/25 16:22, Yu Kuai wrote:
>> I agree that lock contention here will not affect HDD performance.
>> What I suspect the difference in my environment is that the order of rqs
>> might be changed from elevator dispatching them and the disk handling
>> them.
>>
>> For example, the order can be easily revised if more than one context
>> dispatch one request at a time:
>>
>> t1:
>>
>> lock
>> rq1 = dd_dispatch_request
>> unlock
>> 			t2:
>> 			lock
>> 			rq2 = dd_dispatch_request
>> 			unlock
>>
>> lock
>> rq3 = dd_dispatch_request
>> unlock
>>
>> 			lock
>> 			rq4 = dd_dispatch_request
>> 			unlock
>>
>> //rq1,rq3 issue to disk
>> 			// rq2, rq4 issue to disk
>>
>> In this case, the elevator dispatch order is rq 1-2-3-4, however,
>> such order in disk is rq 1-3-2-4.
>>
>> And with batch requests dispatch, will this less likely to happen?
> 
> If you are running a write test with the HDD write cache enabled, such
> reordering will most liley not matter at all. Running the same workload with
> "none" and I get the same IOPS for writes.
> 
> Check your disk. If you do have the HDD write cache disabled, then sure, the
> order will matter more depending on how your drive handles WCD writes (recent
> drives have very similar performance with WCE and WCD).
> 
Thanks for the explanation, I'll test more workload on more disks, and
of corese, explain details more in the formal version as you suggested.

Kuai