[v2] cleanup and bugfix of sync

[PATCH v2 06/11] md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update

Posted by linan666@huaweicloud.com 3 months ago

From: Li Nan <linan122@huawei.com>

When sync IO failed and setting badblock also failed, unsynced disk
might be kicked via setting 'recovery_disable' without Faulty flag.
MD_RECOVERY_ERROR was set in md_sync_error() to prevent updating
'resync_offset', avoiding reading the failed sync sectors.

Previous patch ensures disk is marked Faulty when badblock setting fails.
Remove MD_RECOVERY_ERROR handling as it's no longer needed - failed sync
sectors are unreadable either via badblock or Faulty disk.

Simplify resync_offset update logic.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/md.h |  2 --
 drivers/md/md.c | 23 +++++------------------
 2 files changed, 5 insertions(+), 20 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 18621dba09a9..c5b5377e9049 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -644,8 +644,6 @@ enum recovery_flags {
 	MD_RECOVERY_FROZEN,
 	/* waiting for pers->start() to finish */
 	MD_RECOVERY_WAIT,
-	/* interrupted because io-error */
-	MD_RECOVERY_ERROR,
 
 	/* flags determines sync action, see details in enum sync_action */
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2bdbb5b0e9e1..71988d8f5154 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8949,7 +8949,6 @@ void md_sync_error(struct mddev *mddev)
 {
 	// stop recovery, signal do_sync ....
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-	set_bit(MD_RECOVERY_ERROR, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
 }
 EXPORT_SYMBOL(md_sync_error);
@@ -9603,8 +9602,8 @@ void md_do_sync(struct md_thread *thread)
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
-	    !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 	    mddev->curr_resync >= MD_RESYNC_ACTIVE) {
+		/* All sync IO completes after recovery_active becomes 0 */
 		mddev->curr_resync_completed = mddev->curr_resync;
 		sysfs_notify_dirent_safe(mddev->sysfs_completed);
 	}
@@ -9612,24 +9611,12 @@ void md_do_sync(struct md_thread *thread)
 
 	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
 	    mddev->curr_resync > MD_RESYNC_ACTIVE) {
+		if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+			mddev->curr_resync = MaxSector;
+
 		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
-				if (mddev->curr_resync >= mddev->resync_offset) {
-					pr_debug("md: checkpointing %s of %s.\n",
-						 desc, mdname(mddev));
-					if (test_bit(MD_RECOVERY_ERROR,
-						&mddev->recovery))
-						mddev->resync_offset =
-							mddev->curr_resync_completed;
-					else
-						mddev->resync_offset =
-							mddev->curr_resync;
-				}
-			} else
-				mddev->resync_offset = MaxSector;
+			mddev->resync_offset = mddev->curr_resync;
 		} else {
-			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
-				mddev->curr_resync = MaxSector;
 			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
 				rcu_read_lock();
-- 
2.39.2

Re: [PATCH v2 06/11] md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update

Posted by Yu Kuai 3 months ago

Hi,

在 2025/11/6 19:59, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
>
> When sync IO failed and setting badblock also failed, unsynced disk
> might be kicked via setting 'recovery_disable' without Faulty flag.
> MD_RECOVERY_ERROR was set in md_sync_error() to prevent updating
> 'resync_offset', avoiding reading the failed sync sectors.
>
> Previous patch ensures disk is marked Faulty when badblock setting fails.
> Remove MD_RECOVERY_ERROR handling as it's no longer needed - failed sync
> sectors are unreadable either via badblock or Faulty disk.
>
> Simplify resync_offset update logic.
>
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   drivers/md/md.h |  2 --
>   drivers/md/md.c | 23 +++++------------------
>   2 files changed, 5 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 18621dba09a9..c5b5377e9049 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -644,8 +644,6 @@ enum recovery_flags {
>   	MD_RECOVERY_FROZEN,
>   	/* waiting for pers->start() to finish */
>   	MD_RECOVERY_WAIT,
> -	/* interrupted because io-error */
> -	MD_RECOVERY_ERROR,
>   
>   	/* flags determines sync action, see details in enum sync_action */
>   
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2bdbb5b0e9e1..71988d8f5154 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8949,7 +8949,6 @@ void md_sync_error(struct mddev *mddev)
>   {
>   	// stop recovery, signal do_sync ....
>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -	set_bit(MD_RECOVERY_ERROR, &mddev->recovery);
>   	md_wakeup_thread(mddev->thread);
>   }
>   EXPORT_SYMBOL(md_sync_error);
> @@ -9603,8 +9602,8 @@ void md_do_sync(struct md_thread *thread)
>   	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>   
>   	if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> -	    !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&

Why the above checking is removed?

Thanks,
Kuai

>   	    mddev->curr_resync >= MD_RESYNC_ACTIVE) {
> +		/* All sync IO completes after recovery_active becomes 0 */
>   		mddev->curr_resync_completed = mddev->curr_resync;
>   		sysfs_notify_dirent_safe(mddev->sysfs_completed);
>   	}
> @@ -9612,24 +9611,12 @@ void md_do_sync(struct md_thread *thread)
>   
>   	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
>   	    mddev->curr_resync > MD_RESYNC_ACTIVE) {
> +		if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> +			mddev->curr_resync = MaxSector;
> +
>   		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> -			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> -				if (mddev->curr_resync >= mddev->resync_offset) {
> -					pr_debug("md: checkpointing %s of %s.\n",
> -						 desc, mdname(mddev));
> -					if (test_bit(MD_RECOVERY_ERROR,
> -						&mddev->recovery))
> -						mddev->resync_offset =
> -							mddev->curr_resync_completed;
> -					else
> -						mddev->resync_offset =
> -							mddev->curr_resync;
> -				}
> -			} else
> -				mddev->resync_offset = MaxSector;
> +			mddev->resync_offset = mddev->curr_resync;
>   		} else {
> -			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> -				mddev->curr_resync = MaxSector;
>   			if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>   			    test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) {
>   				rcu_read_lock();

Re: [PATCH v2 06/11] md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update

Posted by Li Nan 2 months, 4 weeks ago


在 2025/11/8 18:22, Yu Kuai 写道:
> Hi,
> 
> 在 2025/11/6 19:59, linan666@huaweicloud.com 写道:
>> From: Li Nan <linan122@huawei.com>
>>
>> When sync IO failed and setting badblock also failed, unsynced disk
>> might be kicked via setting 'recovery_disable' without Faulty flag.
>> MD_RECOVERY_ERROR was set in md_sync_error() to prevent updating
>> 'resync_offset', avoiding reading the failed sync sectors.
>>
>> Previous patch ensures disk is marked Faulty when badblock setting fails.
>> Remove MD_RECOVERY_ERROR handling as it's no longer needed - failed sync
>> sectors are unreadable either via badblock or Faulty disk.
>>
>> Simplify resync_offset update logic.
>>
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>>    drivers/md/md.h |  2 --
>>    drivers/md/md.c | 23 +++++------------------
>>    2 files changed, 5 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 18621dba09a9..c5b5377e9049 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -644,8 +644,6 @@ enum recovery_flags {
>>    	MD_RECOVERY_FROZEN,
>>    	/* waiting for pers->start() to finish */
>>    	MD_RECOVERY_WAIT,
>> -	/* interrupted because io-error */
>> -	MD_RECOVERY_ERROR,
>>    
>>    	/* flags determines sync action, see details in enum sync_action */
>>    
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2bdbb5b0e9e1..71988d8f5154 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8949,7 +8949,6 @@ void md_sync_error(struct mddev *mddev)
>>    {
>>    	// stop recovery, signal do_sync ....
>>    	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -	set_bit(MD_RECOVERY_ERROR, &mddev->recovery);
>>    	md_wakeup_thread(mddev->thread);
>>    }
>>    EXPORT_SYMBOL(md_sync_error);
>> @@ -9603,8 +9602,8 @@ void md_do_sync(struct md_thread *thread)
>>    	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>>    
>>    	if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> -	    !test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> 
> Why the above checking is removed?
> 
> Thanks,
> Kuai
> 

Before patch 05, a error sync IO might end and decrement recovery_active,
but its error handling is not completed. It sets recovery_disabled and
MD_RECOVERY_INTR, then remove the error disk later. If
'curr_resync_completed' is updated before the disk is removed, it may cause
reading from the sync-failed regions.

After patch 05, the error IO will definitely be handled. After waiting for
'recovery_active' to become 0 in the previous line, all sync IO has
completed regardless of whether MD_RECOVERY_INTR is set. Thus, this check
can be removed.

So I added the following comment:

>>    	    mddev->curr_resync >= MD_RESYNC_ACTIVE) {
>> +		/* All sync IO completes after recovery_active becomes 0 */
>>    		mddev->curr_resync_completed = mddev->curr_resync;

Since the logic behind this change is complex, should I separate it into a
new commit?

-- 
Thanks,
Nan

Re: [PATCH v2 06/11] md: remove MD_RECOVERY_ERROR handling and simplify resync_offset update

Posted by Yu Kuai 2 months, 1 week ago

Hi,

在 2025/11/10 20:17, Li Nan 写道:
>
> Since the logic behind this change is complex, should I separate it 
> into a
> new commit? 

Please separate.

-- 
Thanks,
Kuai