md: ensure child flush IO does not affect origin bio->bi_status

[PATCH] md: ensure child flush IO does not affect origin bio->bi_status

Posted by linan666@huaweicloud.com 1 year, 4 months ago

From: Li Nan <linan122@huawei.com>

When a flush is issued to an RAID array, a child flush IO is created and
issued for each member disk in the RAID array. Since commit b75197e86e6d
("md: Remove flush handling"), each child flush IO has been chained with
the original bio. As a result, the failure of any child IO could modify
the bi_status of the original bio, potentially impacting the upper-layer
filesystem.

Fix the issue by preventing child flush IO from altering the original
bio->bi_status as before. However, this design introduces a known
issue: in the event of a power failure, if a flush IO on a member
disk fails, the upper layers may not be informed. This issue is not easy
to fix and will not be addressed for the time being in this issue.

Fixes: b75197e86e6d ("md: Remove flush handling")
Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 179ee4afe937..67108c397c5a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -546,6 +546,26 @@ static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev, int opener_n
 	return 0;
 }
 
+/*
+ * The only difference from bio_chain_endio() is that the current
+ * bi_status of bio does not affect the bi_status of parent.
+ */
+static void md_end_flush(struct bio *bio)
+{
+	struct bio *parent = bio->bi_private;
+
+	/*
+	 * If any flush io error before the power failure,
+	 * disk data may be lost.
+	 */
+	if (bio->bi_status)
+		pr_err("md: %pg flush io error %d\n", bio->bi_bdev,
+			blk_status_to_errno(bio->bi_status));
+
+	bio_put(bio);
+	bio_endio(parent);
+}
+
 bool md_flush_request(struct mddev *mddev, struct bio *bio)
 {
 	struct md_rdev *rdev;
@@ -565,7 +585,9 @@ bool md_flush_request(struct mddev *mddev, struct bio *bio)
 		new = bio_alloc_bioset(rdev->bdev, 0,
 				       REQ_OP_WRITE | REQ_PREFLUSH, GFP_NOIO,
 				       &mddev->bio_set);
-		bio_chain(new, bio);
+		new->bi_private = bio;
+		new->bi_end_io = md_end_flush;
+		bio_inc_remaining(bio);
 		submit_bio(new);
 	}
 
-- 
2.39.2

Re: [PATCH] md: ensure child flush IO does not affect origin bio->bi_status

Posted by Song Liu 1 year, 3 months ago

On Wed, Sep 18, 2024 at 11:33 PM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> When a flush is issued to an RAID array, a child flush IO is created and
> issued for each member disk in the RAID array. Since commit b75197e86e6d
> ("md: Remove flush handling"), each child flush IO has been chained with
> the original bio. As a result, the failure of any child IO could modify
> the bi_status of the original bio, potentially impacting the upper-layer
> filesystem.
>
> Fix the issue by preventing child flush IO from altering the original
> bio->bi_status as before. However, this design introduces a known
> issue: in the event of a power failure, if a flush IO on a member
> disk fails, the upper layers may not be informed. This issue is not easy
> to fix and will not be addressed for the time being in this issue.
>
> Fixes: b75197e86e6d ("md: Remove flush handling")
> Signed-off-by: Li Nan <linan122@huawei.com>

Applied to md-6.12.

Thanks for the fix!
Song

Re: [PATCH] md: ensure child flush IO does not affect origin bio->bi_status

Posted by Yu Kuai 1 year, 4 months ago

Hi,

在 2024/09/19 14:30, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
> 
> When a flush is issued to an RAID array, a child flush IO is created and
> issued for each member disk in the RAID array. Since commit b75197e86e6d
> ("md: Remove flush handling"), each child flush IO has been chained with
> the original bio. As a result, the failure of any child IO could modify
> the bi_status of the original bio, potentially impacting the upper-layer
> filesystem.
> 
> Fix the issue by preventing child flush IO from altering the original
> bio->bi_status as before. However, this design introduces a known
> issue: in the event of a power failure, if a flush IO on a member
> disk fails, the upper layers may not be informed. This issue is not easy
> to fix and will not be addressed for the time being in this issue.
> 
> Fixes: b75197e86e6d ("md: Remove flush handling")
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   drivers/md/md.c | 24 +++++++++++++++++++++++-
>   1 file changed, 23 insertions(+), 1 deletion(-)

LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 179ee4afe937..67108c397c5a 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -546,6 +546,26 @@ static int mddev_set_closing_and_sync_blockdev(struct mddev *mddev, int opener_n
>   	return 0;
>   }
>   
> +/*
> + * The only difference from bio_chain_endio() is that the current
> + * bi_status of bio does not affect the bi_status of parent.
> + */
> +static void md_end_flush(struct bio *bio)
> +{
> +	struct bio *parent = bio->bi_private;
> +
> +	/*
> +	 * If any flush io error before the power failure,
> +	 * disk data may be lost.
> +	 */

The only solution I can think of is treating flush IO the same
as meta IO, just call md_error() this rdev.

Thanks,
Kuai


> +	if (bio->bi_status)
> +		pr_err("md: %pg flush io error %d\n", bio->bi_bdev,
> +			blk_status_to_errno(bio->bi_status));
> +
> +	bio_put(bio);
> +	bio_endio(parent);
> +}
> +
>   bool md_flush_request(struct mddev *mddev, struct bio *bio)
>   {
>   	struct md_rdev *rdev;
> @@ -565,7 +585,9 @@ bool md_flush_request(struct mddev *mddev, struct bio *bio)
>   		new = bio_alloc_bioset(rdev->bdev, 0,
>   				       REQ_OP_WRITE | REQ_PREFLUSH, GFP_NOIO,
>   				       &mddev->bio_set);
> -		bio_chain(new, bio);
> +		new->bi_private = bio;
> +		new->bi_end_io = md_end_flush;
> +		bio_inc_remaining(bio);
>   		submit_bio(new);
>   	}
>   
>