From: Yu Kuai <yukuai3@huawei.com>
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up, for example, in raid456 IO will first be
split by max_sector from md_submit_bio(), and then later be split
again by chunksize for internal handling:
For example, assume max_sectors is 1M, and chunksize is 512k
1) issue a 2M IO:
bio issuing: 0+2M
current->bio_list: NULL
2) md_submit_bio() split by max_sector:
bio issuing: 0+1M
current->bio_list: 1M+1M
3) chunk_aligned_read() split by chunksize:
bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k
4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:
bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k
5) chunk_aligned_read() split by chunksize:
bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k
6) no split afterwards, finally the issue order is:
0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.
Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
Test results:
Before this patch
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
2) blktrace G stage:
8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))
Read|Write seek
cnt 55175, zero cnt 25079
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 25079 45.5% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 12540 22.7% |##################### |
64 .. 128 : 2508 4.5% |##### |
128 .. 256 : 0 0.0% | |
256 .. 512 : 10032 18.2% |################# |
512 .. 1024 : 5016 9.1% |######### |
After this patch:
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
2) blktrace G stage:
8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 21483 75.0% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 7161 25.0% |############## |
BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.
And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.
Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-core.c | 21 ++++++++++++++-------
block/blk-throttle.c | 2 +-
block/blk.h | 2 +-
3 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 37836446f365..b643d3b1e9fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
-void submit_bio_noacct_nocheck(struct bio *bio)
+void submit_bio_noacct_nocheck(struct bio *bio, bool split)
{
blk_cgroup_bio_start(bio);
blkcg_bio_issue_init(bio);
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* to collect a list of requests submited by a ->submit_bio method while
* it is active, and then process them after it returned.
*/
- if (current->bio_list)
- bio_list_add(¤t->bio_list[0], bio);
- else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+ if (current->bio_list) {
+ if (split)
+ bio_list_add_head(¤t->bio_list[0], bio);
+ else
+ bio_list_add(¤t->bio_list[0], bio);
+ } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
__submit_bio_noacct_mq(bio);
- else
+ } else {
__submit_bio_noacct(bio);
+ }
}
static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
@@ -769,6 +773,9 @@ void submit_split_bio_noacct(struct bio *bio)
{
might_sleep();
+ /* This helper should only be called from submit_bio context */
+ WARN_ON_ONCE(!current->bio_list);
+
if (should_fail_bio(bio)) {
bio_io_error(bio);
return;
@@ -777,7 +784,7 @@ void submit_split_bio_noacct(struct bio *bio)
if (blk_throtl_bio(bio))
return;
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, true);
}
/**
@@ -886,7 +893,7 @@ void submit_bio_noacct(struct bio *bio)
if (blk_throtl_bio(bio))
return;
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, false);
return;
not_supported:
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 397b6a410f9e..ead7b0eb4846 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1224,7 +1224,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
if (!bio_list_empty(&bio_list_on_stack)) {
blk_start_plug(&plug);
while ((bio = bio_list_pop(&bio_list_on_stack)))
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, false);
blk_finish_plug(&plug);
}
}
diff --git a/block/blk.h b/block/blk.h
index 80375374ef55..da120c74c4b5 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -55,7 +55,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
struct task_struct *owner);
int __bio_queue_enter(struct request_queue *q, struct bio *bio);
void submit_split_bio_noacct(struct bio *bio);
-void submit_bio_noacct_nocheck(struct bio *bio);
+void submit_bio_noacct_nocheck(struct bio *bio, bool split);
void bio_await_chain(struct bio *bio);
static inline bool blk_try_enter_queue(struct request_queue *q, bool pm)
--
2.39.2
On 8/28/25 15:57, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > Currently, split bio will be chained to original bio, and original bio > will be resubmitted to the tail of current->bio_list, waiting for > split bio to be issued. However, if split bio get split again, the IO > order will be messed up, for example, in raid456 IO will first be > split by max_sector from md_submit_bio(), and then later be split > again by chunksize for internal handling: > > For example, assume max_sectors is 1M, and chunksize is 512k > > 1) issue a 2M IO: > > bio issuing: 0+2M > current->bio_list: NULL > > 2) md_submit_bio() split by max_sector: > > bio issuing: 0+1M > current->bio_list: 1M+1M > > 3) chunk_aligned_read() split by chunksize: > > bio issuing: 0+512k > current->bio_list: 1M+1M -> 512k+512k > > 4) after first bio issued, __submit_bio_noacct() will contuine issuing > next bio: > > bio issuing: 1M+1M > current->bio_list: 512k+512k > bio issued: 0+512k > > 5) chunk_aligned_read() split by chunksize: > > bio issuing: 1M+512k > current->bio_list: 512k+512k -> 1536k+512k > bio issued: 0+512k > > 6) no split afterwards, finally the issue order is: > > 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k > > This behaviour will cause large IO read on raid456 endup to be small > discontinuous IO in underlying disks. Fix this problem by placing split > bio to the head of current->bio_list. > > Test script: test on 8 disk raid5 with 64k chunksize > dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct > > Test results: > Before this patch > 1) iostat results: > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util > md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 > sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 > 2) blktrace G stage: > 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] > 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] > 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] > 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] > 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] > 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] > 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] > 3) io seek for sdx: > Noted io seek is the result from blktrace D stage, statistic of: > ABS((offset of next IO) - (offset + len of previous IO)) > > Read|Write seek > cnt 55175, zero cnt 25079 > >=(KB) .. <(KB) : count ratio |distribution | > 0 .. 1 : 25079 45.5% |########################################| > 1 .. 2 : 0 0.0% | | > 2 .. 4 : 0 0.0% | | > 4 .. 8 : 0 0.0% | | > 8 .. 16 : 0 0.0% | | > 16 .. 32 : 0 0.0% | | > 32 .. 64 : 12540 22.7% |##################### | > 64 .. 128 : 2508 4.5% |##### | > 128 .. 256 : 0 0.0% | | > 256 .. 512 : 10032 18.2% |################# | > 512 .. 1024 : 5016 9.1% |######### | > > After this patch: > 1) iostat results: > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util > md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 > sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 > 2) blktrace G stage: > 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] > 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] > 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] > 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] > 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] > 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] > 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] > 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] > 3) io seek for sdx > Read|Write seek > cnt 28644, zero cnt 21483 > >=(KB) .. <(KB) : count ratio |distribution | > 0 .. 1 : 21483 75.0% |########################################| > 1 .. 2 : 0 0.0% | | > 2 .. 4 : 0 0.0% | | > 4 .. 8 : 0 0.0% | | > 8 .. 16 : 0 0.0% | | > 16 .. 32 : 0 0.0% | | > 32 .. 64 : 7161 25.0% |############## | > > BTW, this looks like a long term problem from day one, and large > sequential IO read is pretty common case like video playing. > > And even with this patch, in this test case IO is merged to at most 128k > is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such > limit can get even better performance. However, we'll figure out how to do > this properly later. > > Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") > Reported-by: Tie Ren <tieren@fnnas.com> > Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-core.c | 21 ++++++++++++++------- > block/blk-throttle.c | 2 +- > block/blk.h | 2 +- > 3 files changed, 16 insertions(+), 9 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 37836446f365..b643d3b1e9fe 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio) > current->bio_list = NULL; > } > > -void submit_bio_noacct_nocheck(struct bio *bio) > +void submit_bio_noacct_nocheck(struct bio *bio, bool split) > { > blk_cgroup_bio_start(bio); > blkcg_bio_issue_init(bio); > @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) > * to collect a list of requests submited by a ->submit_bio method while > * it is active, and then process them after it returned. > */ > - if (current->bio_list) > - bio_list_add(¤t->bio_list[0], bio); > - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) > + if (current->bio_list) { > + if (split) > + bio_list_add_head(¤t->bio_list[0], bio); > + else > + bio_list_add(¤t->bio_list[0], bio); This really needs a comment clarifying why we do an add at tail instead of keeping the original order with a add at head. I am also scared that this may break sequential write ordering for zoned devices. > + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { > __submit_bio_noacct_mq(bio); > - else > + } else { > __submit_bio_noacct(bio); > + } > } -- Damien Le Moal Western Digital Research
Hi, 在 2025/8/30 9:02, Damien Le Moal 写道: > On 8/28/25 15:57, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> Currently, split bio will be chained to original bio, and original bio >> will be resubmitted to the tail of current->bio_list, waiting for >> split bio to be issued. However, if split bio get split again, the IO >> order will be messed up, for example, in raid456 IO will first be >> split by max_sector from md_submit_bio(), and then later be split >> again by chunksize for internal handling: >> >> For example, assume max_sectors is 1M, and chunksize is 512k >> >> 1) issue a 2M IO: >> >> bio issuing: 0+2M >> current->bio_list: NULL >> >> 2) md_submit_bio() split by max_sector: >> >> bio issuing: 0+1M >> current->bio_list: 1M+1M >> >> 3) chunk_aligned_read() split by chunksize: >> >> bio issuing: 0+512k >> current->bio_list: 1M+1M -> 512k+512k >> >> 4) after first bio issued, __submit_bio_noacct() will contuine issuing >> next bio: >> >> bio issuing: 1M+1M >> current->bio_list: 512k+512k >> bio issued: 0+512k >> >> 5) chunk_aligned_read() split by chunksize: >> >> bio issuing: 1M+512k >> current->bio_list: 512k+512k -> 1536k+512k >> bio issued: 0+512k >> >> 6) no split afterwards, finally the issue order is: >> >> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k >> >> This behaviour will cause large IO read on raid456 endup to be small >> discontinuous IO in underlying disks. Fix this problem by placing split >> bio to the head of current->bio_list. >> >> Test script: test on 8 disk raid5 with 64k chunksize >> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct >> >> Test results: >> Before this patch >> 1) iostat results: >> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util >> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 >> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 >> 2) blktrace G stage: >> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] >> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] >> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] >> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] >> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] >> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] >> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] >> 3) io seek for sdx: >> Noted io seek is the result from blktrace D stage, statistic of: >> ABS((offset of next IO) - (offset + len of previous IO)) >> >> Read|Write seek >> cnt 55175, zero cnt 25079 >> >=(KB) .. <(KB) : count ratio |distribution | >> 0 .. 1 : 25079 45.5% |########################################| >> 1 .. 2 : 0 0.0% | | >> 2 .. 4 : 0 0.0% | | >> 4 .. 8 : 0 0.0% | | >> 8 .. 16 : 0 0.0% | | >> 16 .. 32 : 0 0.0% | | >> 32 .. 64 : 12540 22.7% |##################### | >> 64 .. 128 : 2508 4.5% |##### | >> 128 .. 256 : 0 0.0% | | >> 256 .. 512 : 10032 18.2% |################# | >> 512 .. 1024 : 5016 9.1% |######### | >> >> After this patch: >> 1) iostat results: >> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util >> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 >> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 >> 2) blktrace G stage: >> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] >> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] >> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] >> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] >> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] >> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] >> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] >> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] >> 3) io seek for sdx >> Read|Write seek >> cnt 28644, zero cnt 21483 >> >=(KB) .. <(KB) : count ratio |distribution | >> 0 .. 1 : 21483 75.0% |########################################| >> 1 .. 2 : 0 0.0% | | >> 2 .. 4 : 0 0.0% | | >> 4 .. 8 : 0 0.0% | | >> 8 .. 16 : 0 0.0% | | >> 16 .. 32 : 0 0.0% | | >> 32 .. 64 : 7161 25.0% |############## | >> >> BTW, this looks like a long term problem from day one, and large >> sequential IO read is pretty common case like video playing. >> >> And even with this patch, in this test case IO is merged to at most 128k >> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such >> limit can get even better performance. However, we'll figure out how to do >> this properly later. >> >> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") >> Reported-by: Tie Ren <tieren@fnnas.com> >> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >> --- >> block/blk-core.c | 21 ++++++++++++++------- >> block/blk-throttle.c | 2 +- >> block/blk.h | 2 +- >> 3 files changed, 16 insertions(+), 9 deletions(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 37836446f365..b643d3b1e9fe 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio) >> current->bio_list = NULL; >> } >> >> -void submit_bio_noacct_nocheck(struct bio *bio) >> +void submit_bio_noacct_nocheck(struct bio *bio, bool split) >> { >> blk_cgroup_bio_start(bio); >> blkcg_bio_issue_init(bio); >> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) >> * to collect a list of requests submited by a ->submit_bio method while >> * it is active, and then process them after it returned. >> */ >> - if (current->bio_list) >> - bio_list_add(¤t->bio_list[0], bio); >> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) >> + if (current->bio_list) { >> + if (split) >> + bio_list_add_head(¤t->bio_list[0], bio); >> + else >> + bio_list_add(¤t->bio_list[0], bio); > This really needs a comment clarifying why we do an add at tail instead of > keeping the original order with a add at head. I am also scared that this may > break sequential write ordering for zoned devices. I think add at head is exactly what we do here to keep the orginal order for the case bio split. Other than split, if caller do generate multiple sequential bios, we should keep the order by add at tail. Not sure about zoned devices for now, I'll have a look in details. Thanks, Kuai > >> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { >> __submit_bio_noacct_mq(bio); >> - else >> + } else { >> __submit_bio_noacct(bio); >> + } >> } > >
Hi, 在 2025/08/30 12:28, Yu Kuai 写道: >>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) >>> * to collect a list of requests submited by a ->submit_bio >>> method while >>> * it is active, and then process them after it returned. >>> */ >>> - if (current->bio_list) >>> - bio_list_add(¤t->bio_list[0], bio); >>> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) >>> + if (current->bio_list) { >>> + if (split) >>> + bio_list_add_head(¤t->bio_list[0], bio); >>> + else >>> + bio_list_add(¤t->bio_list[0], bio); >> This really needs a comment clarifying why we do an add at tail >> instead of >> keeping the original order with a add at head. I am also scared that >> this may >> break sequential write ordering for zoned devices. > > I think add at head is exactly what we do here to keep the orginal order > for > the case bio split. Other than split, if caller do generate multiple > sequential > bios, we should keep the order by add at tail. > > Not sure about zoned devices for now, I'll have a look in details. For zoned devices, can we somehow trigger this recursive split? I suspect bio disordered will apear in this case but I don't know for now and I can't find a way to reporduce it. Perhaps I can bypass zoned devices for now, and if we really met the recursive split case and there is a problem, we can fix it later: if (split && !bdev_is_zoned(bio->bi_bdev)) bio_list_add_head() Thanks, Kuai
On 9/1/25 11:40 AM, Yu Kuai wrote: > Hi, > > 在 2025/08/30 12:28, Yu Kuai 写道: >>>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) >>>> * to collect a list of requests submited by a ->submit_bio method while >>>> * it is active, and then process them after it returned. >>>> */ >>>> - if (current->bio_list) >>>> - bio_list_add(¤t->bio_list[0], bio); >>>> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) >>>> + if (current->bio_list) { >>>> + if (split) >>>> + bio_list_add_head(¤t->bio_list[0], bio); >>>> + else >>>> + bio_list_add(¤t->bio_list[0], bio); >>> This really needs a comment clarifying why we do an add at tail instead of >>> keeping the original order with a add at head. I am also scared that this may >>> break sequential write ordering for zoned devices. >> >> I think add at head is exactly what we do here to keep the orginal order for >> the case bio split. Other than split, if caller do generate multiple sequential >> bios, we should keep the order by add at tail. >> >> Not sure about zoned devices for now, I'll have a look in details. > > For zoned devices, can we somehow trigger this recursive split? I > suspect bio disordered will apear in this case but I don't know for > now and I can't find a way to reporduce it. dm-linear can be stacked on e.g. dm-crypt, or the reverse. So recursive splitting may be possible. Though since for DM everything is zone aligned, it may be hard to find a reproducer. Though dm-crypt will always split BIOs to BIO_MAX_VECS << PAGE_SECTORS_SHIFT sectors, so it may be possible with very large BIOs. Would need to try, but really overloaded with other things right now. > > Perhaps I can bypass zoned devices for now, and if we really met the > recursive split case and there is a problem, we can fix it later: > > if (split && !bdev_is_zoned(bio->bi_bdev)) > bio_list_add_head() > > Thanks, > Kuai > -- Damien Le Moal Western Digital Research
© 2016 - 2025 Red Hat, Inc.