[PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split

Yu Kuai posted 15 patches 1 month ago
[PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Yu Kuai 1 month ago
From: Yu Kuai <yukuai3@huawei.com>

Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up, for example, in raid456 IO will first be
split by max_sector from md_submit_bio(), and then later be split
again by chunksize for internal handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current->bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current->bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -> 1M+512k -> 512k+512k -> 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.

Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c     | 21 ++++++++++++++-------
 block/blk-throttle.c |  2 +-
 block/blk.h          |  2 +-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ea194a1a5b2c..6ca3c45f421c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
 	current->bio_list = NULL;
 }
 
-void submit_bio_noacct_nocheck(struct bio *bio)
+void submit_bio_noacct_nocheck(struct bio *bio, bool split)
 {
 	blk_cgroup_bio_start(bio);
 	blkcg_bio_issue_init(bio);
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
 	 * to collect a list of requests submited by a ->submit_bio method while
 	 * it is active, and then process them after it returned.
 	 */
-	if (current->bio_list)
-		bio_list_add(&current->bio_list[0], bio);
-	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+	if (current->bio_list) {
+		if (split && !bdev_is_zoned(bio->bi_bdev))
+			bio_list_add_head(&current->bio_list[0], bio);
+		else
+			bio_list_add(&current->bio_list[0], bio);
+	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
 		__submit_bio_noacct_mq(bio);
-	else
+	} else {
 		__submit_bio_noacct(bio);
+	}
 }
 
 static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
@@ -770,6 +774,9 @@ void submit_split_bio_noacct(struct bio *bio)
 {
 	might_sleep();
 
+	/* This helper should only be called from submit_bio context */
+	WARN_ON_ONCE(!current->bio_list);
+
 	if (should_fail_bio(bio)) {
 		bio_io_error(bio);
 		return;
@@ -778,7 +785,7 @@ void submit_split_bio_noacct(struct bio *bio)
 	if (blk_throtl_bio(bio))
 		return;
 
-	submit_bio_noacct_nocheck(bio);
+	submit_bio_noacct_nocheck(bio, true);
 }
 
 /**
@@ -887,7 +894,7 @@ void submit_bio_noacct(struct bio *bio)
 
 	if (blk_throtl_bio(bio))
 		return;
-	submit_bio_noacct_nocheck(bio);
+	submit_bio_noacct_nocheck(bio, false);
 	return;
 
 not_supported:
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 397b6a410f9e..ead7b0eb4846 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1224,7 +1224,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	if (!bio_list_empty(&bio_list_on_stack)) {
 		blk_start_plug(&plug);
 		while ((bio = bio_list_pop(&bio_list_on_stack)))
-			submit_bio_noacct_nocheck(bio);
+			submit_bio_noacct_nocheck(bio, false);
 		blk_finish_plug(&plug);
 	}
 }
diff --git a/block/blk.h b/block/blk.h
index 68bf637ab7ca..a7207eea7a84 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -55,7 +55,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
 			      struct task_struct *owner);
 int __bio_queue_enter(struct request_queue *q, struct bio *bio);
 void submit_split_bio_noacct(struct bio *bio);
-void submit_bio_noacct_nocheck(struct bio *bio);
+void submit_bio_noacct_nocheck(struct bio *bio, bool split);
 void bio_await_chain(struct bio *bio);
 
 static inline bool blk_try_enter_queue(struct request_queue *q, bool pm)
-- 
2.39.2
Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Christoph Hellwig 4 weeks, 1 day ago
Btw disordered IO sounds a bit odd to me.  I'm not a native english
sepaker, but in the past we've used the term "I/O reordering" for
issues like this.

> +		if (split && !bdev_is_zoned(bio->bi_bdev))

Why are zoned devices special cased here?
Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Yu Kuai 4 weeks, 1 day ago
Hi,

在 2025/9/3 21:34, Christoph Hellwig 写道:
> Btw disordered IO sounds a bit odd to me.  I'm not a native english
> sepaker, but in the past we've used the term "I/O reordering" for
> issues like this.
>
>> +		if (split && !bdev_is_zoned(bio->bi_bdev))
> Why are zoned devices special cased here?
>
I'm not that sure about zoned device before, I'll remove this checking,
and mention the problem Bart met in commit message in the next version.

Thanks,
Kuai

Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Bart Van Assche 4 weeks, 1 day ago
On 9/3/25 9:59 AM, Yu Kuai wrote:
> 在 2025/9/3 21:34, Christoph Hellwig 写道:
>>> +        if (split && !bdev_is_zoned(bio->bi_bdev))
 >>
>> Why are zoned devices special cased here?
>>
> I'm not that sure about zoned device before, I'll remove this checking,
> and mention the problem Bart met in commit message in the next version.

If I keep the bdev_is_zoned() check, test zbd/014 fails. If I remove it,
test zbd/014 passes. I think that's a strong argument to remove that
check. Test zbd/014 is not yet in the blktests repository but is
available here:
https://lore.kernel.org/linux-block/a8a714c7-de3d-4cc9-8c23-38b8dc06f5bb@acm.org/

Thanks,

Bart.
Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Bart Van Assche 1 month ago
On 8/31/25 8:32 PM, Yu Kuai wrote:
> -void submit_bio_noacct_nocheck(struct bio *bio)
> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>   {
>   	blk_cgroup_bio_start(bio);
>   	blkcg_bio_issue_init(bio);
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>   	 * to collect a list of requests submited by a ->submit_bio method while
>   	 * it is active, and then process them after it returned.
>   	 */
> -	if (current->bio_list)
> -		bio_list_add(&current->bio_list[0], bio);
> -	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> +	if (current->bio_list) {
> +		if (split && !bdev_is_zoned(bio->bi_bdev))
> +			bio_list_add_head(&current->bio_list[0], bio);
> +		else
> +			bio_list_add(&current->bio_list[0], bio);

The above change will cause write errors for zoned block devices. As I
have shown before, also for zoned block devices, if a bio is split
insertion must happen at the head of the list. See e.g.
"Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio 
submission order"
(https://lore.kernel.org/linux-block/a0c89df8-4b33-409c-ba43-f9543fb1b091@acm.org/)

Thanks,

Bart.
Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Yu Kuai 1 month ago
Hi,

在 2025/09/03 1:20, Bart Van Assche 写道:
> On 8/31/25 8:32 PM, Yu Kuai wrote:
>> -void submit_bio_noacct_nocheck(struct bio *bio)
>> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>>   {
>>       blk_cgroup_bio_start(bio);
>>       blkcg_bio_issue_init(bio);
>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>        * to collect a list of requests submited by a ->submit_bio 
>> method while
>>        * it is active, and then process them after it returned.
>>        */
>> -    if (current->bio_list)
>> -        bio_list_add(&current->bio_list[0], bio);
>> -    else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>> +    if (current->bio_list) {
>> +        if (split && !bdev_is_zoned(bio->bi_bdev))
>> +            bio_list_add_head(&current->bio_list[0], bio);
>> +        else
>> +            bio_list_add(&current->bio_list[0], bio);
> 
> The above change will cause write errors for zoned block devices. As I
> have shown before, also for zoned block devices, if a bio is split
> insertion must happen at the head of the list. See e.g.
> "Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio 
> submission order"
> (https://lore.kernel.org/linux-block/a0c89df8-4b33-409c-ba43-f9543fb1b091@acm.org/) 
> 

Do you mean we should remove the bdev_is_zoned() checking? I added this
checking because I'm not quite sure about details in zone device, and
this checking is aimed at prevent functional changes in zone device.

So I don't think this change will cause write errors, the write errors
should already exist before this set, right?

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> .
> 

Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Bart Van Assche 1 month ago
On 9/2/25 6:00 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/03 1:20, Bart Van Assche 写道:
>> On 8/31/25 8:32 PM, Yu Kuai wrote:
>>> -void submit_bio_noacct_nocheck(struct bio *bio)
>>> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>>>   {
>>>       blk_cgroup_bio_start(bio);
>>>       blkcg_bio_issue_init(bio);
>>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>>        * to collect a list of requests submited by a ->submit_bio 
>>> method while
>>>        * it is active, and then process them after it returned.
>>>        */
>>> -    if (current->bio_list)
>>> -        bio_list_add(&current->bio_list[0], bio);
>>> -    else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>>> +    if (current->bio_list) {
>>> +        if (split && !bdev_is_zoned(bio->bi_bdev))
>>> +            bio_list_add_head(&current->bio_list[0], bio);
>>> +        else
>>> +            bio_list_add(&current->bio_list[0], bio);
>>
>> The above change will cause write errors for zoned block devices. As I
>> have shown before, also for zoned block devices, if a bio is split
>> insertion must happen at the head of the list. See e.g.
>> "Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio 
>> submission order"
>> (https://lore.kernel.org/linux-block/a0c89df8-4b33-409c-ba43- 
>> f9543fb1b091@acm.org/)
> 
> Do you mean we should remove the bdev_is_zoned() checking? I added this
> checking because I'm not quite sure about details in zone device, and
> this checking is aimed at prevent functional changes in zone device.

Yes, the bdev_is_zoned() check should be removed. I spent a significant
amount of time on root-causing and proposing fixes for write errors
caused by recursive bio splitting for zoned devices. What I learned by
analyzing these write errors is the basis for this feedback.

> So I don't think this change will cause write errors, the write errors
> should already exist before this set, right?

Agreed. Although I haven't checked this yet, if the bdev_is_zoned()
check is removed from this patch, this patch should fix the write errors
triggered by stacking a dm driver on top of a zoned block device if
inline encryption is enabled.

Thanks,

Bart.
Re: [PATCH RFC v3 14/15] block: fix disordered IO in the case recursive split
Posted by Yu Kuai 1 month ago
Hi,

在 2025/09/03 9:12, Bart Van Assche 写道:
> On 9/2/25 6:00 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/09/03 1:20, Bart Van Assche 写道:
>>> On 8/31/25 8:32 PM, Yu Kuai wrote:
>>>> -void submit_bio_noacct_nocheck(struct bio *bio)
>>>> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>>>>   {
>>>>       blk_cgroup_bio_start(bio);
>>>>       blkcg_bio_issue_init(bio);
>>>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>>>        * to collect a list of requests submited by a ->submit_bio 
>>>> method while
>>>>        * it is active, and then process them after it returned.
>>>>        */
>>>> -    if (current->bio_list)
>>>> -        bio_list_add(&current->bio_list[0], bio);
>>>> -    else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>>>> +    if (current->bio_list) {
>>>> +        if (split && !bdev_is_zoned(bio->bi_bdev))
>>>> +            bio_list_add_head(&current->bio_list[0], bio);
>>>> +        else
>>>> +            bio_list_add(&current->bio_list[0], bio);
>>>
>>> The above change will cause write errors for zoned block devices. As I
>>> have shown before, also for zoned block devices, if a bio is split
>>> insertion must happen at the head of the list. See e.g.
>>> "Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio 
>>> submission order"
>>> (https://lore.kernel.org/linux-block/a0c89df8-4b33-409c-ba43- 
>>> f9543fb1b091@acm.org/)
>>
>> Do you mean we should remove the bdev_is_zoned() checking? I added this
>> checking because I'm not quite sure about details in zone device, and
>> this checking is aimed at prevent functional changes in zone device.
> 
> Yes, the bdev_is_zoned() check should be removed. I spent a significant
> amount of time on root-causing and proposing fixes for write errors
> caused by recursive bio splitting for zoned devices. What I learned by
> analyzing these write errors is the basis for this feedback.
> 
>> So I don't think this change will cause write errors, the write errors
>> should already exist before this set, right?
> 
> Agreed. Although I haven't checked this yet, if the bdev_is_zoned()
> check is removed from this patch, this patch should fix the write errors
> triggered by stacking a dm driver on top of a zoned block device if
> inline encryption is enabled.
> 

Ok, I'll remove the checking and mention this problem in the next formal
version.

Thanks,
Kuai

> Thanks,
> 
> Bart.
> .
>