md/raid1: skip recovery of already synced areas

[PATCH] md/raid1: skip recovery of already synced areas

Posted by linan666@huaweicloud.com 5 months ago

From: Li Nan <linan122@huawei.com>

When a new disk is added during running recovery, the kernel may
restart recovery from the beginning of the device and submit write
io to ranges that have already been synchronized.

Reproduce:
  mdadm -CR /dev/md0 -l1 -n3 /dev/sda missing missing
  mdadm --add /dev/md0 /dev/sdb
  sleep 10
  cat /proc/mdstat	# partially synchronized
  mdadm --add /dev/md0 /dev/sdc
  cat /proc/mdstat	# start from 0
  iostat 1 sdb sdc	# sdb has io, too

If 'rdev->recovery_offset' is ahead of the current recovery sector,
read from that device instead of issuing a write. It prevents
unnecessary writes while still preserving the chance to back up data
if it is the last copy.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid1.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3e422854cafb..ac5a9b73157a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2894,7 +2894,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 		    test_bit(Faulty, &rdev->flags)) {
 			if (i < conf->raid_disks)
 				still_degraded = true;
-		} else if (!test_bit(In_sync, &rdev->flags)) {
+		} else if (!test_bit(In_sync, &rdev->flags) &&
+			   rdev->recovery_offset <= sector_nr) {
 			bio->bi_opf = REQ_OP_WRITE;
 			bio->bi_end_io = end_sync_write;
 			write_targets ++;
@@ -2903,6 +2904,9 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 			sector_t first_bad = MaxSector;
 			sector_t bad_sectors;
 
+			if (!test_bit(In_sync, &rdev->flags))
+				good_sectors = min(rdev->recovery_offset - sector_nr,
+						   (u64)good_sectors);
 			if (is_badblock(rdev, sector_nr, good_sectors,
 					&first_bad, &bad_sectors)) {
 				if (first_bad > sector_nr)
-- 
2.39.2

Re: [PATCH] md/raid1: skip recovery of already synced areas

Posted by John Stoffel 4 months, 3 weeks ago

>>>>> "linan666" == linan666  <linan666@huaweicloud.com> writes:

> From: Li Nan <linan122@huawei.com>
> When a new disk is added during running recovery, the kernel may
> restart recovery from the beginning of the device and submit write
> io to ranges that have already been synchronized.

Isn't it beter to be safe than sorry?  If the resync fails for some
reason, how can we be sure the devices really are in sync if you don't
force the re-write?  

> Reproduce:
>   mdadm -CR /dev/md0 -l1 -n3 /dev/sda missing missing
>   mdadm --add /dev/md0 /dev/sdb
>   sleep 10
>   cat /proc/mdstat	# partially synchronized
>   mdadm --add /dev/md0 /dev/sdc
>   cat /proc/mdstat	# start from 0
>   iostat 1 sdb sdc	# sdb has io, too

> If 'rdev->recovery_offset' is ahead of the current recovery sector,
> read from that device instead of issuing a write. It prevents
> unnecessary writes while still preserving the chance to back up data
> if it is the last copy.


Are you saying that sdb here can continute writing from block N, but
that your change will only force sdc to start writing from block 0?
Your description of the problem isn't really clear.  

I think it's because you're using the word device for both the MD
device itself, as well as the underlying device(s) or component(s) of
the MD device.  



> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>  drivers/md/raid1.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)

> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 3e422854cafb..ac5a9b73157a 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2894,7 +2894,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  		    test_bit(Faulty, &rdev->flags)) {
>  			if (i < conf->raid_disks)
>  				still_degraded = true;
> -		} else if (!test_bit(In_sync, &rdev->flags)) {
> +		} else if (!test_bit(In_sync, &rdev->flags) &&
> +			   rdev->recovery_offset <= sector_nr) {
bio-> bi_opf = REQ_OP_WRITE;
bio-> bi_end_io = end_sync_write;
>  			write_targets ++;
> @@ -2903,6 +2904,9 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  			sector_t first_bad = MaxSector;
>  			sector_t bad_sectors;
 
> +			if (!test_bit(In_sync, &rdev->flags))
> +				good_sectors = min(rdev->recovery_offset - sector_nr,
> +						   (u64)good_sectors);
>  			if (is_badblock(rdev, sector_nr, good_sectors,
>  					&first_bad, &bad_sectors)) {
>  				if (first_bad > sector_nr)
> -- 
> 2.39.2

Re: [PATCH] md/raid1: skip recovery of already synced areas

Posted by Li Nan 4 months, 3 weeks ago


在 2025/9/18 4:05, John Stoffel 写道:
>>>>>> "linan666" == linan666  <linan666@huaweicloud.com> writes:
> 
>> From: Li Nan <linan122@huawei.com>
>> When a new disk is added during running recovery, the kernel may
>> restart recovery from the beginning of the device and submit write
>> io to ranges that have already been synchronized.
> 
> Isn't it beter to be safe than sorry?  If the resync fails for some
> reason, how can we be sure the devices really are in sync if you don't
> force the re-write?
> 
>> Reproduce:
>>    mdadm -CR /dev/md0 -l1 -n3 /dev/sda missing missing
>>    mdadm --add /dev/md0 /dev/sdb
>>    sleep 10
>>    cat /proc/mdstat	# partially synchronized
>>    mdadm --add /dev/md0 /dev/sdc
>>    cat /proc/mdstat	# start from 0
>>    iostat 1 sdb sdc	# sdb has io, too
> 
>> If 'rdev->recovery_offset' is ahead of the current recovery sector,
>> read from that device instead of issuing a write. It prevents
>> unnecessary writes while still preserving the chance to back up data
>> if it is the last copy.
> 
> 
> Are you saying that sdb here can continute writing from block N, but
> that your change will only force sdc to start writing from block 0?
> Your description of the problem isn't really clear.
> 
Thanks for your review. You're right, I will describe it more accurately
in the next patch.

> I think it's because you're using the word device for both the MD
> device itself, as well as the underlying device(s) or component(s) of
> the MD device.
> 
> 
> 
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>>   drivers/md/raid1.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 3e422854cafb..ac5a9b73157a 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -2894,7 +2894,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>>   		    test_bit(Faulty, &rdev->flags)) {
>>   			if (i < conf->raid_disks)
>>   				still_degraded = true;
>> -		} else if (!test_bit(In_sync, &rdev->flags)) {
>> +		} else if (!test_bit(In_sync, &rdev->flags) &&
>> +			   rdev->recovery_offset <= sector_nr) {
> bio-> bi_opf = REQ_OP_WRITE;
> bio-> bi_end_io = end_sync_write;

As Yu Kuai mentioned, I will refactor this part of the code later.
This patch will be dropped.

>>   			write_targets ++;
>> @@ -2903,6 +2904,9 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>>   			sector_t first_bad = MaxSector;
>>   			sector_t bad_sectors;
>   
>> +			if (!test_bit(In_sync, &rdev->flags))
>> +				good_sectors = min(rdev->recovery_offset - sector_nr,
>> +						   (u64)good_sectors);
>>   			if (is_badblock(rdev, sector_nr, good_sectors,
>>   					&first_bad, &bad_sectors)) {
>>   				if (first_bad > sector_nr)
>> -- 
>> 2.39.2
> 
> 
> 
> .

-- 
Thanks,
Nan

Re: [PATCH] md/raid1: skip recovery of already synced areas

Posted by Yu Kuai 5 months ago

Hi,

在 2025/09/10 16:25, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
> 
> When a new disk is added during running recovery, the kernel may
> restart recovery from the beginning of the device and submit write
> io to ranges that have already been synchronized.
> 
> Reproduce:
>    mdadm -CR /dev/md0 -l1 -n3 /dev/sda missing missing
>    mdadm --add /dev/md0 /dev/sdb
>    sleep 10
>    cat /proc/mdstat	# partially synchronized
>    mdadm --add /dev/md0 /dev/sdc
>    cat /proc/mdstat	# start from 0
>    iostat 1 sdb sdc	# sdb has io, too
> 
> If 'rdev->recovery_offset' is ahead of the current recovery sector,
> read from that device instead of issuing a write. It prevents
> unnecessary writes while still preserving the chance to back up data
> if it is the last copy.
> 
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   drivers/md/raid1.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 3e422854cafb..ac5a9b73157a 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2894,7 +2894,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>   		    test_bit(Faulty, &rdev->flags)) {
>   			if (i < conf->raid_disks)
>   				still_degraded = true;
> -		} else if (!test_bit(In_sync, &rdev->flags)) {
> +		} else if (!test_bit(In_sync, &rdev->flags) &&
> +			   rdev->recovery_offset <= sector_nr) {
>   			bio->bi_opf = REQ_OP_WRITE;
>   			bio->bi_end_io = end_sync_write;
>   			write_targets ++;
> @@ -2903,6 +2904,9 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>   			sector_t first_bad = MaxSector;
>   			sector_t bad_sectors;
>   
> +			if (!test_bit(In_sync, &rdev->flags))
> +				good_sectors = min(rdev->recovery_offset - sector_nr,
> +						   (u64)good_sectors);
>   			if (is_badblock(rdev, sector_nr, good_sectors,
>   					&first_bad, &bad_sectors)) {
>   				if (first_bad > sector_nr)
> 

This patch looks correct, however, I took a long time to go through all
the details, and there is still the same problem in the case new disk is
added during resync.

Perhaps this is a good time to cleanup raid1_sync_request() now, just
don't mess resync and recover together in the same code path with lots
of special handling.

Thanks,
Kuai