From nobody Sat Oct 4 01:47:33 2025 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2E512E8DFD; Thu, 21 Aug 2025 07:55:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755762931; cv=none; b=bTG3ZUJzL5uB9P6efu6H9Ox/iWOqJ3LKPozl8eeThqNslsc4ntG+VlXblWj3jrLTEs1iS41GfH+Z8DcYbvFKYZuG9rE5UOJYhXNQu9VG2yYgVwZwjSOINHzuS45c8lQV0iW0utIkju+uqzGIk8lms+W9N9BgKiZsVAywnJwabKc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755762931; c=relaxed/simple; bh=luh7zUJNPaiMiCUc9aL02ayTbOO3oMhsEoGmoaeY6JM=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=jpJTgPuGQ/25rr4Crk3aeUYsw1Twr2C/2ATTHyKPg14IJpNhzRN7NCFc4Lo9kqj3J6FWl8jdDTJw8KYFvs71GM0UGeYixmYCldyIll1iwbwf8F+cYJ3imSmhkLqbHclsxIM385zNJxP+37fPtKbYVY6M2fWS6dOj7X7fQKFkq2Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTPS id 4c6wbW5l1ZzKHNSC; Thu, 21 Aug 2025 15:55:19 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id 44D711A0ADA; Thu, 21 Aug 2025 15:55:19 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP4 (Coremail) with SMTP id gCh0CgAHgxPl0KZoQiypEQ--.5616S4; Thu, 21 Aug 2025 15:55:19 +0800 (CST) From: Yu Kuai To: axboe@kernel.dk, neil@brown.name, akpm@linux-foundation.org Cc: linux-block@vger.kernel.org, linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, colyli@kernel.org, xni@redhat.com, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com, johnny.chenyi@huawei.com Subject: [PATCH] block: fix disordered IO in the case recursive split Date: Thu, 21 Aug 2025 15:47:06 +0800 Message-Id: <20250821074707.2604640-1-yukuai1@huaweicloud.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: gCh0CgAHgxPl0KZoQiypEQ--.5616S4 X-Coremail-Antispam: 1UD129KBjvJXoW3AryrZr13CF1UGryrCrWfKrg_yoW7GF45pr 47Gw1Ykr1DGF47Ar48GrWj9a1xtF98Cr4rCry5C3yfJr4YgrnFq3ZrAay0yasxAryUurZ8 Xa4kKr9093s2vaDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9Y14x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26w1j6s0DM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U JVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jr0_Jr4lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJV W8JwACjcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2 Y2ka0xkIwI1lc7CjxVAaw2AFwI0_Jw0_GFyl42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x 0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2 zVAF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF 4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWU CwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCT nIWIevJa73UjIFyTuYvjfUonmRUUUUU X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ Content-Type: text/plain; charset="utf-8" From: Yu Kuai Currently, split bio will be chained to original bio, and original bio will be resubmitted to the tail of current->bio_list, waiting for split bio to be issued. However, if split bio get split again, the IO order will be messed up, for example, in raid456 IO will first be split by max_sector from md_submit_bio(), and then later be split again by chunksize for internal handling: For example, assume max_sectors is 1M, and chunksize is 512k 1) issue a 2M IO: bio issuing: 0+2M current->bio_list: NULL 2) md_submit_bio() split by max_sector: bio issuing: 0+1M current->bio_list: 1M+1M 3) chunk_aligned_read() split by chunksize: bio issuing: 0+512k current->bio_list: 1M+1M -> 512k+512k 4) after first bio issued, __submit_bio_noacct() will contuine issuing next bio: bio issuing: 1M+1M current->bio_list: 512k+512k bio issued: 0+512k 5) chunk_aligned_read() split by chunksize: bio issuing: 1M+512k current->bio_list: 512k+512k -> 1536k+512k bio issued: 0+512k 6) no split afterwards, finally the issue order is: 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k This behaviour will cause large IO read on raid456 endup to be small discontinuous IO in underlying disks. Fix this problem by placing chanied bio to the head of current->bio_list. Test script: test on 8 disk raid5 with 64k chunksize dd if=3D/dev/md0 of=3D/dev/null bs=3D4480k iflag=3Ddirect Test results: Before this patch 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %= util md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 = 80.10 sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 7= 1.20 2) blktrace G stage: 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] 3) io seek for sdx: Noted io seek is the result from blktrace D stage, statistic of: ABS((offset of next IO) - (offset + len of previous IO)) Read|Write seek cnt 55175, zero cnt 25079 >=3D(KB) .. <(KB) : count ratio |distribution = | 0 .. 1 : 25079 45.5% |##############################= ##########| 1 .. 2 : 0 0.0% | = | 2 .. 4 : 0 0.0% | = | 4 .. 8 : 0 0.0% | = | 8 .. 16 : 0 0.0% | = | 16 .. 32 : 0 0.0% | = | 32 .. 64 : 12540 22.7% |##################### = | 64 .. 128 : 2508 4.5% |##### = | 128 .. 256 : 0 0.0% | = | 256 .. 512 : 10032 18.2% |################# = | 512 .. 1024 : 5016 9.1% |######### = | After this patch: 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %= util md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 = 90.60 sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 8= 6.50 2) blktrace G stage: 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] 2) io seek for sdx Read|Write seek cnt 28644, zero cnt 21483 >=3D(KB) .. <(KB) : count ratio |distribution = | 0 .. 1 : 21483 75.0% |##############################= ##########| 1 .. 2 : 0 0.0% | = | 2 .. 4 : 0 0.0% | = | 4 .. 8 : 0 0.0% | = | 8 .. 16 : 0 0.0% | = | 16 .. 32 : 0 0.0% | = | 32 .. 64 : 7161 25.0% |############## = | BTW, this looks like a long term problem from day one, and large sequential IO read is pretty common case like video playing. And even with this patch, in this test case IO is merged to at most 128k is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such limit and cat get even better performance. However, we'll figure out how to do this properly later. Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm)= , the recursive calls") Signed-off-by: Yu Kuai Tested-by: Coly Li --- block/blk-core.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4201504158a1..0d46d10edb22 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) * to collect a list of requests submited by a ->submit_bio method while * it is active, and then process them after it returned. */ - if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) + if (current->bio_list) { + if (bio_flagged(bio, BIO_CHAIN)) + bio_list_add_head(¤t->bio_list[0], bio); + else + bio_list_add(¤t->bio_list[0], bio); + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { __submit_bio_noacct_mq(bio); - else + } else { __submit_bio_noacct(bio); + } } =20 static blk_status_t blk_validate_atomic_write_op_size(struct request_queue= *q, --=20 2.39.2