md/raid5: fix race between reshape and chunk-aligned read

[PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by FengWei Shih 2 months, 1 week ago

raid5_make_request() checks mddev->reshape_position to decide whether
to allow chunk-aligned reads. However in raid5_start_reshape(), the
layout configuration (raid_disks, algorithm, etc.) is updated before
mddev->reshape_position is set:

  reshape (raid5_start_reshape)        read (raid5_make_request)
  ==============================       ===========================
  write_seqcount_begin
  update raid_disks, algorithm...
  set conf->reshape_progress
  write_seqcount_end
                                        check mddev->reshape_position
                                          * still MaxSector, allow
                                        raid5_read_one_chunk()
                                          * use new layout
  raid5_quiesce()
  set mddev->reshape_position

Since reshape_position is not yet updated, raid5_make_request()
considers no reshape is in progress and proceeds with the
chunk-aligned path, but the layout has already changed, causing
raid5_compute_sector() to return an incorrect physical address.

Fix this by reading conf->reshape_progress under gen_lock in
raid5_read_one_chunk() and falling back to the stripe path if a
reshape is in progress.

Signed-off-by: FengWei Shih <dannyshih@synology.com>
---
 drivers/md/raid5.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a8e8d431071b..bded2b86f0ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 	sector_t sector, end_sector;
 	int dd_idx;
 	bool did_inc;
+	int seq;
+
+	seq = read_seqcount_begin(&conf->gen_lock);
+	if (unlikely(conf->reshape_progress != MaxSector))
+		return 0;
 
 	if (!in_chunk_boundary(mddev, raid_bio)) {
 		pr_debug("%s: non aligned\n", __func__);
@@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 				      &dd_idx, NULL);
 	end_sector = sector + bio_sectors(raid_bio);
 
+	if (read_seqcount_retry(&conf->gen_lock, seq))
+		return 0;
+
 	if (r5c_big_stripe_cached(conf, sector))
 		return 0;
 
-- 
2.25.1


Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.

Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by Yu Kuai 2 months ago

Hi,

在 2026/4/9 13:17, FengWei Shih 写道:
> raid5_make_request() checks mddev->reshape_position to decide whether
> to allow chunk-aligned reads. However in raid5_start_reshape(), the
> layout configuration (raid_disks, algorithm, etc.) is updated before
> mddev->reshape_position is set:
>
>    reshape (raid5_start_reshape)        read (raid5_make_request)
>    ==============================       ===========================
>    write_seqcount_begin
>    update raid_disks, algorithm...
>    set conf->reshape_progress
>    write_seqcount_end
>                                          check mddev->reshape_position
>                                            * still MaxSector, allow
>                                          raid5_read_one_chunk()
>                                            * use new layout
>    raid5_quiesce()
>    set mddev->reshape_position

While we're here, I think it's pretty ugly to disable raid5_read_one_chunk
when reshape is not fully done. So a better solution should be:

- data behind reshape: read with new layout
- data ahead of reshape: read with old layout, reshape will also need to wait
for this IO to be done, before reshape can make progress.
- data intersecting the active reshape window: wait for reshape to make progress.

>
> Since reshape_position is not yet updated, raid5_make_request()
> considers no reshape is in progress and proceeds with the
> chunk-aligned path, but the layout has already changed, causing
> raid5_compute_sector() to return an incorrect physical address.
>
> Fix this by reading conf->reshape_progress under gen_lock in
> raid5_read_one_chunk() and falling back to the stripe path if a
> reshape is in progress.
>
> Signed-off-by: FengWei Shih <dannyshih@synology.com>
> ---
>   drivers/md/raid5.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index a8e8d431071b..bded2b86f0ef 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>   	sector_t sector, end_sector;
>   	int dd_idx;
>   	bool did_inc;
> +	int seq;
> +
> +	seq = read_seqcount_begin(&conf->gen_lock);
> +	if (unlikely(conf->reshape_progress != MaxSector))
> +		return 0;
>   
>   	if (!in_chunk_boundary(mddev, raid_bio)) {
>   		pr_debug("%s: non aligned\n", __func__);
> @@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>   				      &dd_idx, NULL);
>   	end_sector = sector + bio_sectors(raid_bio);
>   
> +	if (read_seqcount_retry(&conf->gen_lock, seq))
> +		return 0;
> +
>   	if (r5c_big_stripe_cached(conf, sector))
>   		return 0;
>   

-- 
Thansk,
Kuai

Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by FengWei Shih 2 months ago

Hi Kuai,

Yu Kuai 於 2026/4/19 下午 01:14 寫道:
> Hi,
>
> 在 2026/4/9 13:17, FengWei Shih 写道:
>> raid5_make_request() checks mddev->reshape_position to decide whether
>> to allow chunk-aligned reads. However in raid5_start_reshape(), the
>> layout configuration (raid_disks, algorithm, etc.) is updated before
>> mddev->reshape_position is set:
>>
>>     reshape (raid5_start_reshape)        read (raid5_make_request)
>>     ==============================       ===========================
>>     write_seqcount_begin
>>     update raid_disks, algorithm...
>>     set conf->reshape_progress
>>     write_seqcount_end
>>                                           check mddev->reshape_position
>>                                             * still MaxSector, allow
>>                                           raid5_read_one_chunk()
>>                                             * use new layout
>>     raid5_quiesce()
>>     set mddev->reshape_position
> While we're here, I think it's pretty ugly to disable raid5_read_one_chunk
> when reshape is not fully done. So a better solution should be:
>
> - data behind reshape: read with new layout
> - data ahead of reshape: read with old layout, reshape will also need to wait
> for this IO to be done, before reshape can make progress.
> - data intersecting the active reshape window: wait for reshape to make progress.

Thanks for the feedback. I agree that using reshape_progress to distinguish
the three cases (ahead/behind/inside reshape) would be more refined.

But disabling chunk-aligned reads during reshape was already the 
existing design.
In the original code, the check is at the caller level:

     if (rw == READ && mddev->degraded == 0 &&
         mddev->reshape_position == MaxSector) {
         bi = chunk_aligned_read(mddev, bi);
     }

My patch is focused on fixing the race condition in the existing 
lockless check of whether
reshape is in progress. So just to confirm: are you suggesting we add a 
mechanism to
allow chunk-aligned reads during reshape (based on reshape progress), rather
than simply disabling them?

>> Since reshape_position is not yet updated, raid5_make_request()
>> considers no reshape is in progress and proceeds with the
>> chunk-aligned path, but the layout has already changed, causing
>> raid5_compute_sector() to return an incorrect physical address.
>>
>> Fix this by reading conf->reshape_progress under gen_lock in
>> raid5_read_one_chunk() and falling back to the stripe path if a
>> reshape is in progress.
>>
>> Signed-off-by: FengWei Shih <dannyshih@synology.com>
>> ---
>>    drivers/md/raid5.c | 8 ++++++++
>>    1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index a8e8d431071b..bded2b86f0ef 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>>    	sector_t sector, end_sector;
>>    	int dd_idx;
>>    	bool did_inc;
>> +	int seq;
>> +
>> +	seq = read_seqcount_begin(&conf->gen_lock);
>> +	if (unlikely(conf->reshape_progress != MaxSector))
>> +		return 0;
>>    
>>    	if (!in_chunk_boundary(mddev, raid_bio)) {
>>    		pr_debug("%s: non aligned\n", __func__);
>> @@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>>    				      &dd_idx, NULL);
>>    	end_sector = sector + bio_sectors(raid_bio);
>>    
>> +	if (read_seqcount_retry(&conf->gen_lock, seq))
>> +		return 0;
>> +
>>    	if (r5c_big_stripe_cached(conf, sector))
>>    		return 0;
>>    
Thanks,
FengWei Shih

Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.

Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by Yu Kuai 1 month, 3 weeks ago

Hi,

在 2026/4/21 15:09, FengWei Shih 写道:
> Hi Kuai,
>
> Yu Kuai 於 2026/4/19 下午 01:14 寫道:
>> Hi,
>>
>> 在 2026/4/9 13:17, FengWei Shih 写道:
>>> raid5_make_request() checks mddev->reshape_position to decide whether
>>> to allow chunk-aligned reads. However in raid5_start_reshape(), the
>>> layout configuration (raid_disks, algorithm, etc.) is updated before
>>> mddev->reshape_position is set:
>>>
>>>     reshape (raid5_start_reshape)        read (raid5_make_request)
>>>     ============================== ===========================
>>>     write_seqcount_begin
>>>     update raid_disks, algorithm...
>>>     set conf->reshape_progress
>>>     write_seqcount_end
>>>                                           check mddev->reshape_position
>>>                                             * still MaxSector, allow
>>> raid5_read_one_chunk()
>>>                                             * use new layout
>>>     raid5_quiesce()
>>>     set mddev->reshape_position
>> While we're here, I think it's pretty ugly to disable 
>> raid5_read_one_chunk
>> when reshape is not fully done. So a better solution should be:
>>
>> - data behind reshape: read with new layout
>> - data ahead of reshape: read with old layout, reshape will also need 
>> to wait
>> for this IO to be done, before reshape can make progress.
>> - data intersecting the active reshape window: wait for reshape to 
>> make progress.
>
> Thanks for the feedback. I agree that using reshape_progress to 
> distinguish
> the three cases (ahead/behind/inside reshape) would be more refined.
>
> But disabling chunk-aligned reads during reshape was already the 
> existing design.
> In the original code, the check is at the caller level:
>
>     if (rw == READ && mddev->degraded == 0 &&
>         mddev->reshape_position == MaxSector) {
>         bi = chunk_aligned_read(mddev, bi);
>     }
>
> My patch is focused on fixing the race condition in the existing 
> lockless check of whether
> reshape is in progress. So just to confirm: are you suggesting we add 
> a mechanism to
> allow chunk-aligned reads during reshape (based on reshape progress), 
> rather
> than simply disabling them?

Yes, I think this is a huge performance improvement. And in this case it's totally
fine to do chunk aligned read with old layout since reshape do not start yet.

>
>>> Since reshape_position is not yet updated, raid5_make_request()
>>> considers no reshape is in progress and proceeds with the
>>> chunk-aligned path, but the layout has already changed, causing
>>> raid5_compute_sector() to return an incorrect physical address.
>>>
>>> Fix this by reading conf->reshape_progress under gen_lock in
>>> raid5_read_one_chunk() and falling back to the stripe path if a
>>> reshape is in progress.
>>>
>>> Signed-off-by: FengWei Shih <dannyshih@synology.com>
>>> ---
>>>    drivers/md/raid5.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index a8e8d431071b..bded2b86f0ef 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev 
>>> *mddev, struct bio *raid_bio)
>>>        sector_t sector, end_sector;
>>>        int dd_idx;
>>>        bool did_inc;
>>> +    int seq;
>>> +
>>> +    seq = read_seqcount_begin(&conf->gen_lock);
>>> +    if (unlikely(conf->reshape_progress != MaxSector))
>>> +        return 0;
>>>           if (!in_chunk_boundary(mddev, raid_bio)) {
>>>            pr_debug("%s: non aligned\n", __func__);
>>> @@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev 
>>> *mddev, struct bio *raid_bio)
>>>                          &dd_idx, NULL);
>>>        end_sector = sector + bio_sectors(raid_bio);
>>>    +    if (read_seqcount_retry(&conf->gen_lock, seq))
>>> +        return 0;
>>> +
>>>        if (r5c_big_stripe_cached(conf, sector))
>>>            return 0;
> Thanks,
> FengWei Shih
>
> Disclaimer: The contents of this e-mail message and any attachments 
> are confidential and are intended solely for addressee. The 
> information may also be legally privileged. This transmission is sent 
> in trust, for the sole purpose of delivery to the intended recipient. 
> If you have received this transmission in error, any use, reproduction 
> or dissemination of this transmission is strictly prohibited. If you 
> are not the intended recipient, please immediately notify the sender 
> by reply e-mail or phone and delete this message and its attachments, 
> if any.

-- 
Thansk,
Kuai

Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by Li Nan 2 months, 1 week ago


在 2026/4/9 13:17, FengWei Shih 写道:
> raid5_make_request() checks mddev->reshape_position to decide whether
> to allow chunk-aligned reads. However in raid5_start_reshape(), the
> layout configuration (raid_disks, algorithm, etc.) is updated before
> mddev->reshape_position is set:
> 
>    reshape (raid5_start_reshape)        read (raid5_make_request)
>    ==============================       ===========================
>    write_seqcount_begin
>    update raid_disks, algorithm...
>    set conf->reshape_progress
>    write_seqcount_end
>                                          check mddev->reshape_position
>                                            * still MaxSector, allow
>                                          raid5_read_one_chunk()
>                                            * use new layout
>    raid5_quiesce()
>    set mddev->reshape_position
> 
> Since reshape_position is not yet updated, raid5_make_request()
> considers no reshape is in progress and proceeds with the
> chunk-aligned path, but the layout has already changed, causing
> raid5_compute_sector() to return an incorrect physical address.
> 
> Fix this by reading conf->reshape_progress under gen_lock in
> raid5_read_one_chunk() and falling back to the stripe path if a
> reshape is in progress.
> 
> Signed-off-by: FengWei Shih <dannyshih@synology.com>
> ---
>   drivers/md/raid5.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index a8e8d431071b..bded2b86f0ef 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>   	sector_t sector, end_sector;
>   	int dd_idx;
>   	bool did_inc;
> +	int seq;
> +
> +	seq = read_seqcount_begin(&conf->gen_lock);
> +	if (unlikely(conf->reshape_progress != MaxSector))
> +		return 0;
>   
>   	if (!in_chunk_boundary(mddev, raid_bio)) {
>   		pr_debug("%s: non aligned\n", __func__);
> @@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>   				      &dd_idx, NULL);
>   	end_sector = sector + bio_sectors(raid_bio);
>   
> +	if (read_seqcount_retry(&conf->gen_lock, seq))
> +		return 0;
> +
>   	if (r5c_big_stripe_cached(conf, sector))
>   		return 0;
>   

It seems that there might be race issues wherever raid5_compute_sector is
used? This fix only addresses one of the problems.

-- 
Thanks,
Nan

Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

Posted by FengWei Shih 2 months, 1 week ago

Hi Nan,

Li Nan 於 2026/4/13 下午 03:19 寫道:
>
>
> 在 2026/4/9 13:17, FengWei Shih 写道:
>> raid5_make_request() checks mddev->reshape_position to decide whether
>> to allow chunk-aligned reads. However in raid5_start_reshape(), the
>> layout configuration (raid_disks, algorithm, etc.) is updated before
>> mddev->reshape_position is set:
>>
>>    reshape (raid5_start_reshape)        read (raid5_make_request)
>>    ============================== ===========================
>>    write_seqcount_begin
>>    update raid_disks, algorithm...
>>    set conf->reshape_progress
>>    write_seqcount_end
>>                                          check mddev->reshape_position
>>                                            * still MaxSector, allow
>>                                          raid5_read_one_chunk()
>>                                            * use new layout
>>    raid5_quiesce()
>>    set mddev->reshape_position
>>
>> Since reshape_position is not yet updated, raid5_make_request()
>> considers no reshape is in progress and proceeds with the
>> chunk-aligned path, but the layout has already changed, causing
>> raid5_compute_sector() to return an incorrect physical address.
>>
>> Fix this by reading conf->reshape_progress under gen_lock in
>> raid5_read_one_chunk() and falling back to the stripe path if a
>> reshape is in progress.
>>
>> Signed-off-by: FengWei Shih <dannyshih@synology.com>
>> ---
>>   drivers/md/raid5.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index a8e8d431071b..bded2b86f0ef 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev 
>> *mddev, struct bio *raid_bio)
>>       sector_t sector, end_sector;
>>       int dd_idx;
>>       bool did_inc;
>> +    int seq;
>> +
>> +    seq = read_seqcount_begin(&conf->gen_lock);
>> +    if (unlikely(conf->reshape_progress != MaxSector))
>> +        return 0;
>>         if (!in_chunk_boundary(mddev, raid_bio)) {
>>           pr_debug("%s: non aligned\n", __func__);
>> @@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev 
>> *mddev, struct bio *raid_bio)
>>                         &dd_idx, NULL);
>>       end_sector = sector + bio_sectors(raid_bio);
>>   +    if (read_seqcount_retry(&conf->gen_lock, seq))
>> +        return 0;
>> +
>>       if (r5c_big_stripe_cached(conf, sector))
>>           return 0;
>
> It seems that there might be race issues wherever raid5_compute_sector is
> used? This fix only addresses one of the problems.
>
Thanks for the review. You are right that this race pattern affects
more than just raid5_read_one_chunk(). I checked the callers of
raid5_compute_*() and some lockless reshape_progress / reshape_position
checks:

Already safe:

- make_stripe_request(): already uses gen_lock seqcount properly.

- init_stripe() / stripe_set_idx(): init_stripe() is under gen_lock;
   stripe_set_idx() is only used for the dst stripe in handle_stripe().

- handle_stripe_expansion() / reshape_request(): reshape-internal,
   intentional new layout.

- raid5-cache.c / raid5-ppl.c: journal/PPL are not allowed with
   reshape, so no race.

- raid5_bio_lowest_chunk_sector(): no lock protection, but the return
   value is bounded within the bio's stripe range, so the worst case
   is suboptimal I/O ordering — no data corruption or lost I/O.

Needs fix:

- raid5_read_one_chunk(): fixed by this patch.

- retry_aligned_read(): has a similar issue. I will fix it with
   gen_lock seqcount in the next version of this patch.

- raid5_bitmap_sector(): if the check sees LOC_NO_RESHAPE but reshape
   starts and passes this region before the stripe is processed, the
   bitmap position will not match the layout used for the write.

- make_discard_request(): checks mddev->reshape_position and computes
   logical sectors from the layout fields, all without any lock.

- raid5_make_request(): the lockless reshape_progress check that decides
   on_wq can race with reshape start.

I think there are two possible solutions:

1. All callers should check reshape within proper locking
    (conf->device_lock or conf->gen_lock).

2. Suspend I/O during start_reshape via mddev_suspend() /
    mddev_resume(), so readers do not need to worry about seeing an
    inconsistent state.

I am going with direction 1 for now since gen_lock seems to be
designed for exactly this kind of race.

Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.