[v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

[PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by chengkaitao 4 months ago

From: Chengkaitao <chengkaitao@kylinos.cn>

When adding a request to the sort_list/fifo_list, the kernel assigns
{jiffies + dd->fifo_expire[data_dir]} to the fifo_time member.
Consequently, we must subtract {dd->fifo_expire[rq_data_dir(rq)]} in
the started_after function.

In contrast, Commit (725f22a1477c) introduced changes to utilize the
fifo_timevmember of requests on a dispatch list. The patch assigns
{jiffies} to the fifo_time member when adding a request to a dispatch
list. However, it continues to use the started_after function, which
still subtracts {dd->fifo_expire[rq_data_dir(rq)]} from the start_time.
The commit message does not explain this design choice, though it appears
reasonably justifiable since dispatch lists should inherently have higher
priority than sort_lists/fifo_lists. Thus, the default fifo_time set for
dispatch lists should be smaller than that set for sort_lists/fifo_lists.

Originally, {dd->fifo_expire[data_dir]} was exclusively used in the
deadline_check_fifo function to control the timing of scanning fifo_lists.
This subsequently determines the offset for the next scan of sort_lists
via {dd->per_prio[prio].latest_pos[data_dir]}. Therefore, a larger
{dd->fifo_expire[data_dir]} value makes it less likely for timed-out
requests to be scanned.

However, Commit (725f22a1477c) reversed the semantic meaning of
{dd->fifo_expire[data_dir]}. When adding a request to a dispatch list,
the patch only assigns {jiffies} to the fifo_time member without
compensating for the {start_time -= dd->fifo_expire[rq_data_dir(rq)]}
operation in the started_after function. This results in larger
{dd->fifo_expire[data_dir]} values making timed-out requests more likely
to be scanned. By default, fifo_expire[DD_WRITE] > fifo_expire[DD_READ],
which incorrectly gives write-IO-requests higher priority than
read-IO-requests and creates inherently illogical prioritization.

On the other hand, the Commit (725f22a1477c) merges the effects of
fifo_expire and prio_aging_expire on the same code behavior, creating
redundant interactions. To address this, our new patch introduces
numerical compensation for {dd->fifo_expire[data_dir]} when adding
requests to dispatch lists. To maintain original logic as much as
possible while enhancing dispatch list priority, we additionally
subtract {dd->prio_aging_expire / 2} from the fifo_time, with default
values, {dd->prio_aging_expire / 2} equals {dd->fifo_expire[DD_WRITE]}.

Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn>
---
v2: Add a more detailed commit message

Link to V1:
https://lore.kernel.org/all/20250926023818.16223-1-pilgrimtao@gmail.com/

 block/mq-deadline.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3e741d33142d..fedc66187150 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -659,7 +659,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	if (flags & BLK_MQ_INSERT_AT_HEAD) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
-		rq->fifo_time = jiffies;
+		rq->fifo_time = jiffies + dd->fifo_expire[data_dir]
+				- dd->prio_aging_expire / 2;
 	} else {
 		deadline_add_rq_rb(per_prio, rq);
 
-- 
2.50.1 (Apple Git-155)

Re: [PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by Bart Van Assche 4 months ago

On 10/9/25 8:52 AM, chengkaitao wrote:
> On the other hand, the Commit (725f22a1477c) merges the effects of
> fifo_expire and prio_aging_expire on the same code behavior, creating
> redundant interactions. To address this, our new patch introduces
> numerical compensation for {dd->fifo_expire[data_dir]} when adding
> requests to dispatch lists. To maintain original logic as much as
> possible while enhancing dispatch list priority, we additionally
> subtract {dd->prio_aging_expire / 2} from the fifo_time, with default
> values, {dd->prio_aging_expire / 2} equals {dd->fifo_expire[DD_WRITE]}.

No assumptions should be made about the relative values of
dd->prio_aging_expire and dd->fifo_expire[DD_WRITE] since these values
can be modified via sysfs.

> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 3e741d33142d..fedc66187150 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -659,7 +659,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   
>   	if (flags & BLK_MQ_INSERT_AT_HEAD) {
>   		list_add(&rq->queuelist, &per_prio->dispatch);
> -		rq->fifo_time = jiffies;
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir]
> +				- dd->prio_aging_expire / 2;
>   	} else {
>   		deadline_add_rq_rb(per_prio, rq);

Thanks for having added a detailed patch description. Please remove
"/ 2" from the above patch to make sure that BLK_MQ_INSERT_AT_HEAD
requests are submitted to the block driver before other requests. This
is important if a request is requeued. Additionally, a comment should be
added above the modified line of code that explains the purpose of the
calculation. How about this untested patch?

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3e741d33142d..566646591ddd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -659,7 +659,14 @@ static void dd_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,

  	if (flags & BLK_MQ_INSERT_AT_HEAD) {
  		list_add(&rq->queuelist, &per_prio->dispatch);
-		rq->fifo_time = jiffies;
+		/*
+		 * started_after() subtracts dd->fifo_expire[data_dir] from
+		 * rq->fifo_time. Hence, add it here. Subtract
+		 * dd->prio_aging_expire to ensure that AT HEAD requests are
+		 * submitted before higher priority requests.
+		 */
+		rq->fifo_time = jiffies + dd->fifo_expire[data_dir] -
+				dd->prio_aging_expire;
  	} else {
  		deadline_add_rq_rb(per_prio, rq);

Thanks,

Bart.

Re: [PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by Damien Le Moal 4 months ago

On 2025/10/10 1:50, Bart Van Assche wrote:
> On 10/9/25 8:52 AM, chengkaitao wrote:
>> On the other hand, the Commit (725f22a1477c) merges the effects of
>> fifo_expire and prio_aging_expire on the same code behavior, creating
>> redundant interactions. To address this, our new patch introduces
>> numerical compensation for {dd->fifo_expire[data_dir]} when adding
>> requests to dispatch lists. To maintain original logic as much as
>> possible while enhancing dispatch list priority, we additionally
>> subtract {dd->prio_aging_expire / 2} from the fifo_time, with default
>> values, {dd->prio_aging_expire / 2} equals {dd->fifo_expire[DD_WRITE]}.
> 
> No assumptions should be made about the relative values of
> dd->prio_aging_expire and dd->fifo_expire[DD_WRITE] since these values
> can be modified via sysfs.
> 
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index 3e741d33142d..fedc66187150 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -659,7 +659,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>>   
>>   	if (flags & BLK_MQ_INSERT_AT_HEAD) {
>>   		list_add(&rq->queuelist, &per_prio->dispatch);
>> -		rq->fifo_time = jiffies;
>> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir]
>> +				- dd->prio_aging_expire / 2;
>>   	} else {
>>   		deadline_add_rq_rb(per_prio, rq);
> 
> Thanks for having added a detailed patch description. Please remove
> "/ 2" from the above patch to make sure that BLK_MQ_INSERT_AT_HEAD
> requests are submitted to the block driver before other requests. This
> is important if a request is requeued. Additionally, a comment should be
> added above the modified line of code that explains the purpose of the
> calculation. How about this untested patch?
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 3e741d33142d..566646591ddd 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -659,7 +659,14 @@ static void dd_insert_request(struct blk_mq_hw_ctx 
> *hctx, struct request *rq,
> 
>   	if (flags & BLK_MQ_INSERT_AT_HEAD) {
>   		list_add(&rq->queuelist, &per_prio->dispatch);
> -		rq->fifo_time = jiffies;
> +		/*
> +		 * started_after() subtracts dd->fifo_expire[data_dir] from
> +		 * rq->fifo_time. Hence, add it here. Subtract
> +		 * dd->prio_aging_expire to ensure that AT HEAD requests are
> +		 * submitted before higher priority requests.
> +		 */
> +		rq->fifo_time = jiffies + dd->fifo_expire[data_dir] -
> +				dd->prio_aging_expire;

There is still something bothering me with this: the request is added to the
dispatch list, and *NOT* to the fifo/sort list. So this should be considered as
a scheduling decision in itself, and __dd_dispatch_request() reflects that as
the first thing it does is pick the requests that are in the dispatch list
already. However, __dd_dispatch_request() also has the check:

		if (started_after(dd, rq, latest_start))
                        return NULL;

for requests that are already in the dispatch list. That is what does not make
sense to me. Why ? There is no comment describing this. And I do not understand
why we should bother with any time for requests that are in the dispatch list
already. These should be sent to the drive first, always.

This patch seems to be fixing a problem that is introduced by the above check.
But why this check ? What am I missing here ?

>   	} else {
>   		deadline_add_rq_rb(per_prio, rq);
> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by Bart Van Assche 4 months ago

On 10/9/25 1:21 PM, Damien Le Moal wrote:
> There is still something bothering me with this: the request is added to the
> dispatch list, and *NOT* to the fifo/sort list. So this should be considered as
> a scheduling decision in itself, and __dd_dispatch_request() reflects that as
> the first thing it does is pick the requests that are in the dispatch list
> already. However, __dd_dispatch_request() also has the check:
> 
> 		if (started_after(dd, rq, latest_start))
>                          return NULL;
> 
> for requests that are already in the dispatch list. That is what does not make
> sense to me. Why ? There is no comment describing this. And I do not understand
> why we should bother with any time for requests that are in the dispatch list
> already. These should be sent to the drive first, always.
> 
> This patch seems to be fixing a problem that is introduced by the above check.
> But why this check ? What am I missing here ?

Is my conclusion from the above correct that there is agreement that the 
I/O priority should be ignored for AT HEAD requests and that AT HEAD
requests should always be dispatched first? If so, how about merging the
three per I/O priority dispatch lists into a single dispatch list and
not to call started_after() at all for the dispatch list?

Thanks,

Bart.

Re: [PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by Damien Le Moal 4 months ago

On 2025/10/10 8:40, Bart Van Assche wrote:
> 
> On 10/9/25 1:21 PM, Damien Le Moal wrote:
>> There is still something bothering me with this: the request is added to the
>> dispatch list, and *NOT* to the fifo/sort list. So this should be considered as
>> a scheduling decision in itself, and __dd_dispatch_request() reflects that as
>> the first thing it does is pick the requests that are in the dispatch list
>> already. However, __dd_dispatch_request() also has the check:
>>
>> 		if (started_after(dd, rq, latest_start))
>>                          return NULL;
>>
>> for requests that are already in the dispatch list. That is what does not make
>> sense to me. Why ? There is no comment describing this. And I do not understand
>> why we should bother with any time for requests that are in the dispatch list
>> already. These should be sent to the drive first, always.
>>
>> This patch seems to be fixing a problem that is introduced by the above check.
>> But why this check ? What am I missing here ?
> 
> Is my conclusion from the above correct that there is agreement that the 
> I/O priority should be ignored for AT HEAD requests and that AT HEAD
> requests should always be dispatched first? If so, how about merging the
> three per I/O priority dispatch lists into a single dispatch list and
> not to call started_after() at all for the dispatch list?

More generally speaking, I think we really should have a single dispatch list:
the scheduling policy (per prio with aging, and either lba or fifo ordered)
decides on what request should be sent by adding them to the dispatch list. If
we reschedule again requests that are in the dispatch list, we end up with a
very complicated and hard to debug scheduling policy. Kaitao's patch is proof of
that...

And for the at-head requests, it is indeed debatable what to do. We could be a
little intelligent about them and do a sorted insert in the dispatch list based
on the requests priorities so that we do not have priority inversions. But that
feels like an optimization. I think that the most common case for at-head
insertion is (rare) requeue, no ? Have not checked.

So I feel like Kaitao's problem can be fixed exactly like you said: use a single
dispatch queue and drop the started_after() check in __dd_dispatch_request() for
the dispatch queue.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] block/mq-deadline: adjust the timeout period of the per_prio->dispatch

Posted by Tao pilgrim 4 months ago

On Fri, Oct 10, 2025 at 7:40 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
>
> On 10/9/25 1:21 PM, Damien Le Moal wrote:
> > There is still something bothering me with this: the request is added to the
> > dispatch list, and *NOT* to the fifo/sort list. So this should be considered as
> > a scheduling decision in itself, and __dd_dispatch_request() reflects that as
> > the first thing it does is pick the requests that are in the dispatch list
> > already. However, __dd_dispatch_request() also has the check:
> >
> >               if (started_after(dd, rq, latest_start))
> >                          return NULL;
> >
> > for requests that are already in the dispatch list. That is what does not make
> > sense to me. Why ? There is no comment describing this. And I do not understand
> > why we should bother with any time for requests that are in the dispatch list
> > already. These should be sent to the drive first, always.
> >
> > This patch seems to be fixing a problem that is introduced by the above check.
> > But why this check ? What am I missing here ?
>
> Is my conclusion from the above correct that there is agreement that the
> I/O priority should be ignored for AT HEAD requests and that AT HEAD
> requests should always be dispatched first? If so, how about merging the
> three per I/O priority dispatch lists into a single dispatch list and
> not to call started_after() at all for the dispatch list?

From the current kernel strategy perspective, the dispatch list does
not fundamentally
differ from the fifo/sort list. It essentially treats AT HEAD requests
as higher-priority
requests compared to regular ones, storing them separately in the
dispatch list. When
I/O scheduling is triggered, the kernel's priority consideration
follows this hierarchy:
     (rt/be/idle priorities) > (expired requests in dispatch list)
      > (expired requests in fifo/sort list) > (non-expired requests
in dispatch list)
      > (non-expired requests in fifo/sort list).

I speculate the original design intent addresses scenarios where
requests accumulate
in the dispatch list. The kernel must first scan for expired requests
within the dispatch
list for prioritized processing, while also positioning expired
requests from the fifo/sort
list higher in priority than non-expired requests in the dispatch list.

To merge the three per I/O priority dispatch lists into a single
dispatch list, prerequisite
conditions must be met: ensuring no request backlog occurs in the
dispatch list that
could cause starvation or frequent deadline violations for requests in
the fifo/sort lists.

-- 
Yours,
Kaitao Cheng