[v2] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

[PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

From: Yu Kuai <yukuai3@huawei.com>

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

  raid10_end_write_request
   set_bit(R10BIO_WriteError, &r10_bio->state)
   one_write_done
    reschedule_retry

  // later from md thread
  raid10d
   handle_write_completed
    list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

  // later from md thread
  raid10d
   if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
    list_move(conf->bio_end_io_list.prev, &tmp)
    r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
    raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

  raid10_sync_request
   raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

  action_store
   mddev_lock
    md_unregister_thread
     kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

  raid10d
   md_check_recovery
    if (mddev_trylock(mddev))
     md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 23 +++++++++++++++++++----
 drivers/md/md.h |  2 ++
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63a993b52cd7..7912de0e4d12 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
 	timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
 	atomic_set(&mddev->active, 1);
 	atomic_set(&mddev->openers, 0);
+	atomic_set(&mddev->sync_seq, 0);
 	spin_lock_init(&mddev->lock);
 	atomic_set(&mddev->flush_pending, 0);
 	init_waitqueue_head(&mddev->sb_wait);
@@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
 	if (work_pending(&mddev->del_work))
 		flush_workqueue(md_misc_wq);
 
-	if (mddev->sync_thread) {
-		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-		md_reap_sync_thread(mddev);
-	}
+	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	/*
+	 * Thread might be blocked waiting for metadata update which will now
+	 * never happen
+	 */
+	md_wakeup_thread_directly(mddev->sync_thread);
 
 	mddev_unlock(mddev);
 }
 
 static void idle_sync_thread(struct mddev *mddev)
 {
+	int sync_seq = atomic_read(&mddev->sync_seq);
+
 	mutex_lock(&mddev->sync_mutex);
 	clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
 	stop_sync_thread(mddev);
+
+	wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+			!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
 	mutex_unlock(&mddev->sync_mutex);
 }
 
@@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev *mddev)
 	mutex_init(&mddev->delete_mutex);
 	set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
 	stop_sync_thread(mddev);
+
+	wait_event(resync_wait, mddev->sync_thread == NULL &&
+			!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
 	mutex_unlock(&mddev->sync_mutex);
 }
 
@@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
 
 	/* resync has finished, collect result */
 	md_unregister_thread(&mddev->sync_thread);
+	atomic_inc(&mddev->sync_seq);
+
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
 	    mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2fa903de5bd0..7cab9c7c45b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -539,6 +539,8 @@ struct mddev {
 
 	/* Used to synchronize idle and frozen for action_store() */
 	struct mutex			sync_mutex;
+	/* The sequence number for sync thread */
+	atomic_t sync_seq;
 
 	bool	has_superblocks:1;
 	bool	fail_last_dev:1;
-- 
2.39.2

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Our test found a following deadlock in raid10:
>
> 1) Issue a normal write, and such write failed:
>
>    raid10_end_write_request
>     set_bit(R10BIO_WriteError, &r10_bio->state)
>     one_write_done
>      reschedule_retry
>
>    // later from md thread
>    raid10d
>     handle_write_completed
>      list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>
>    // later from md thread
>    raid10d
>     if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>      list_move(conf->bio_end_io_list.prev, &tmp)
>      r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>      raid_end_bio_io(r10_bio)
>
> Dependency chain 1: normal io is waiting for updating superblock

Hi Kuai

It looks like the above situation is more complex. It only needs a 
normal write and md_write_start needs to

wait until the metadata is written to member disks, right? If so, it 
doesn't need to introduce raid10 write failure

here. I guess, it should be your test case. It's nice, if you can put 
your test steps in the patch. But for the analysis

of the deadlock here, it's better to be simple.

>
> 2) Trigger a recovery:
>
>    raid10_sync_request
>     raise_barrier
>
> Dependency chain 2: sync thread is waiting for normal io
>
> 3) echo idle/frozen to sync_action:
>
>    action_store
>     mddev_lock
>      md_unregister_thread
>       kthread_stop
>
> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>
> 4) md thread can't update superblock:
>
>    raid10d
>     md_check_recovery
>      if (mddev_trylock(mddev))
>       md_update_sb
>
> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>
> Hence cyclic dependency exist, in order to fix the problem, we must
> break one of them. Dependency 1 and 2 can't be broken because they are
> foundation design. Dependency 4 may be possible if it can be guaranteed
> that no io can be inflight, however, this requires a new mechanism which
> seems complex. Dependency 3 is a good choice, because idle/frozen only
> requires sync thread to finish, which can be done asynchronously that is
> already implemented, and 'reconfig_mutex' is not needed anymore.
>
> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> asynchronously, and this patch also add a sequence counter to record how
> many times sync thread is done, so that 'idle' won't keep waiting on new
> started sync thread.

In the patch, sync_seq is added in md_reap_sync_thread. In 
idle_sync_thread, if sync_seq isn't equal

mddev->sync_seq, it should mean there is someone that stops the sync 
thread already, right? Why do

you say 'new started sync thread' here?

Regards

Xiao


>
> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> deadlock can be fixed by this patch as well.
>
> [1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> [2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 23 +++++++++++++++++++----
>   drivers/md/md.h |  2 ++
>   2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 63a993b52cd7..7912de0e4d12 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
>   	timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>   	atomic_set(&mddev->active, 1);
>   	atomic_set(&mddev->openers, 0);
> +	atomic_set(&mddev->sync_seq, 0);
>   	spin_lock_init(&mddev->lock);
>   	atomic_set(&mddev->flush_pending, 0);
>   	init_waitqueue_head(&mddev->sb_wait);
> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
>   	if (work_pending(&mddev->del_work))
>   		flush_workqueue(md_misc_wq);
>   
> -	if (mddev->sync_thread) {
> -		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -		md_reap_sync_thread(mddev);
> -	}
> +	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> +	/*
> +	 * Thread might be blocked waiting for metadata update which will now
> +	 * never happen
> +	 */
> +	md_wakeup_thread_directly(mddev->sync_thread);
>   
>   	mddev_unlock(mddev);
>   }
>   
>   static void idle_sync_thread(struct mddev *mddev)
>   {
> +	int sync_seq = atomic_read(&mddev->sync_seq);
> +
>   	mutex_lock(&mddev->sync_mutex);
>   	clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>   	stop_sync_thread(mddev);
> +
> +	wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> +			!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> +
>   	mutex_unlock(&mddev->sync_mutex);
>   }
>   
> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev *mddev)
>   	mutex_init(&mddev->delete_mutex);
>   	set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>   	stop_sync_thread(mddev);
> +
> +	wait_event(resync_wait, mddev->sync_thread == NULL &&
> +			!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> +
>   	mutex_unlock(&mddev->sync_mutex);
>   }
>   
> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>   
>   	/* resync has finished, collect result */
>   	md_unregister_thread(&mddev->sync_thread);
> +	atomic_inc(&mddev->sync_seq);
> +
>   	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>   	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>   	    mddev->degraded != mddev->raid_disks) {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2fa903de5bd0..7cab9c7c45b8 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -539,6 +539,8 @@ struct mddev {
>   
>   	/* Used to synchronize idle and frozen for action_store() */
>   	struct mutex			sync_mutex;
> +	/* The sequence number for sync thread */
> +	atomic_t sync_seq;
>   
>   	bool	has_superblocks:1;
>   	bool	fail_last_dev:1;

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/13 22:50, Xiao Ni 写道:
> 
> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Our test found a following deadlock in raid10:
>>
>> 1) Issue a normal write, and such write failed:
>>
>>    raid10_end_write_request
>>     set_bit(R10BIO_WriteError, &r10_bio->state)
>>     one_write_done
>>      reschedule_retry
>>
>>    // later from md thread
>>    raid10d
>>     handle_write_completed
>>      list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>>
>>    // later from md thread
>>    raid10d
>>     if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>      list_move(conf->bio_end_io_list.prev, &tmp)
>>      r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>>      raid_end_bio_io(r10_bio)
>>
>> Dependency chain 1: normal io is waiting for updating superblock
> 
> Hi Kuai
> 
> It looks like the above situation is more complex. It only needs a 
> normal write and md_write_start needs to
> 
> wait until the metadata is written to member disks, right? If so, it 
> doesn't need to introduce raid10 write failure
> 
> here. I guess, it should be your test case. It's nice, if you can put 
> your test steps in the patch. But for the analysis
> 
> of the deadlock here, it's better to be simple.

Test script can be found here, it's pretty easy to trigger:

https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@huaweicloud.com/

While reviewing the related code, I found that io can only be added to
list bio_end_io_list from handle_write_completed() if such io failed, so
I think io failure is needed to trigger deadlock from daemon thread.

I think the key point is how MD_SB_CHANGE_PENDING is set:

1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
2) raid10_write_request() related to reshape;
3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
however, I was thinking this is not a common case;

1) is used here because it's quite easy to trigger and this is what
we meet in real test. 3) is possible but I will say let's keep 1), I
don't think it's necessary to reporduce this deadlock through another
path again.

Thanks,
Kuai
> 
>>
>> 2) Trigger a recovery:
>>
>>    raid10_sync_request
>>     raise_barrier
>>
>> Dependency chain 2: sync thread is waiting for normal io
>>
>> 3) echo idle/frozen to sync_action:
>>
>>    action_store
>>     mddev_lock
>>      md_unregister_thread
>>       kthread_stop
>>
>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>>
>> 4) md thread can't update superblock:
>>
>>    raid10d
>>     md_check_recovery
>>      if (mddev_trylock(mddev))
>>       md_update_sb
>>
>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>>
>> Hence cyclic dependency exist, in order to fix the problem, we must
>> break one of them. Dependency 1 and 2 can't be broken because they are
>> foundation design. Dependency 4 may be possible if it can be guaranteed
>> that no io can be inflight, however, this requires a new mechanism which
>> seems complex. Dependency 3 is a good choice, because idle/frozen only
>> requires sync thread to finish, which can be done asynchronously that is
>> already implemented, and 'reconfig_mutex' is not needed anymore.
>>
>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
>> asynchronously, and this patch also add a sequence counter to record how
>> many times sync thread is done, so that 'idle' won't keep waiting on new
>> started sync thread.
> 
> In the patch, sync_seq is added in md_reap_sync_thread. In 
> idle_sync_thread, if sync_seq isn't equal
> 
> mddev->sync_seq, it should mean there is someone that stops the sync 
> thread already, right? Why do
> 
> you say 'new started sync thread' here?
> 
> Regards
> 
> Xiao
> 
> 
>>
>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
>> deadlock can be fixed by this patch as well.
>>
>> [1] 
>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t 
>>
>> [2] 
>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ 
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/md.c | 23 +++++++++++++++++++----
>>   drivers/md/md.h |  2 ++
>>   2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 63a993b52cd7..7912de0e4d12 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
>>       timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>>       atomic_set(&mddev->active, 1);
>>       atomic_set(&mddev->openers, 0);
>> +    atomic_set(&mddev->sync_seq, 0);
>>       spin_lock_init(&mddev->lock);
>>       atomic_set(&mddev->flush_pending, 0);
>>       init_waitqueue_head(&mddev->sb_wait);
>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
>>       if (work_pending(&mddev->del_work))
>>           flush_workqueue(md_misc_wq);
>> -    if (mddev->sync_thread) {
>> -        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -        md_reap_sync_thread(mddev);
>> -    }
>> +    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> +    /*
>> +     * Thread might be blocked waiting for metadata update which will 
>> now
>> +     * never happen
>> +     */
>> +    md_wakeup_thread_directly(mddev->sync_thread);
>>       mddev_unlock(mddev);
>>   }
>>   static void idle_sync_thread(struct mddev *mddev)
>>   {
>> +    int sync_seq = atomic_read(&mddev->sync_seq);
>> +
>>       mutex_lock(&mddev->sync_mutex);
>>       clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +
>> +    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>> +
>>       mutex_unlock(&mddev->sync_mutex);
>>   }
>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev 
>> *mddev)
>>       mutex_init(&mddev->delete_mutex);
>>       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +
>> +    wait_event(resync_wait, mddev->sync_thread == NULL &&
>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>> +
>>       mutex_unlock(&mddev->sync_mutex);
>>   }
>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>>       /* resync has finished, collect result */
>>       md_unregister_thread(&mddev->sync_thread);
>> +    atomic_inc(&mddev->sync_seq);
>> +
>>       if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>>           !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>>           mddev->degraded != mddev->raid_disks) {
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 2fa903de5bd0..7cab9c7c45b8 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -539,6 +539,8 @@ struct mddev {
>>       /* Used to synchronize idle and frozen for action_store() */
>>       struct mutex            sync_mutex;
>> +    /* The sequence number for sync thread */
>> +    atomic_t sync_seq;
>>       bool    has_superblocks:1;
>>       bool    fail_last_dev:1;
> 
> -- 
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/13 22:50, Xiao Ni 写道:
> >
> > 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Our test found a following deadlock in raid10:
> >>
> >> 1) Issue a normal write, and such write failed:
> >>
> >>    raid10_end_write_request
> >>     set_bit(R10BIO_WriteError, &r10_bio->state)
> >>     one_write_done
> >>      reschedule_retry
> >>
> >>    // later from md thread
> >>    raid10d
> >>     handle_write_completed
> >>      list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >>
> >>    // later from md thread
> >>    raid10d
> >>     if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> >>      list_move(conf->bio_end_io_list.prev, &tmp)
> >>      r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >>      raid_end_bio_io(r10_bio)
> >>
> >> Dependency chain 1: normal io is waiting for updating superblock
> >
> > Hi Kuai
> >
> > It looks like the above situation is more complex. It only needs a
> > normal write and md_write_start needs to
> >
> > wait until the metadata is written to member disks, right? If so, it
> > doesn't need to introduce raid10 write failure
> >
> > here. I guess, it should be your test case. It's nice, if you can put
> > your test steps in the patch. But for the analysis
> >
> > of the deadlock here, it's better to be simple.
>
> Test script can be found here, it's pretty easy to trigger:
>
> https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@huaweicloud.com/

Thanks for this.
>
> While reviewing the related code, I found that io can only be added to
> list bio_end_io_list from handle_write_completed() if such io failed, so
> I think io failure is needed to trigger deadlock from daemon thread.
>
> I think the key point is how MD_SB_CHANGE_PENDING is set:
>
> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
> 2) raid10_write_request() related to reshape;
> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
> however, I was thinking this is not a common case;
>
> 1) is used here because it's quite easy to trigger and this is what
> we meet in real test. 3) is possible but I will say let's keep 1), I
> don't think it's necessary to reporduce this deadlock through another
> path again.

It makes sense. Let's go back to the first path mentioned in the patch.

> 1) Issue a normal write, and such write failed:
>
>    raid10_end_write_request
>     set_bit(R10BIO_WriteError, &r10_bio->state)
>     one_write_done
>      reschedule_retry

This is good.
>
>    // later from md thread
>    raid10d
>     handle_write_completed
>      list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

I have a question here. It should run narrow_write_error in
handle_write_completed. In the test case, will narrow_write_error run
successfully? Or it fails and will call rdev_set_badblocks and
md_error. So MD_RECOVERY_PENDING will be set?

>
>    // later from md thread
>    raid10d
>     if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>      list_move(conf->bio_end_io_list.prev, &tmp)
>      r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>      raid_end_bio_io(r10_bio)
>
> Dependency chain 1: normal io is waiting for updating superblock

It's a little hard to understand. Because it doesn't show how normal
io waits for a superblock update. And based on your last email, I
guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
the flag can't be cleared, so the bios can't be added to
bio_end_io_list, so the io rquests can't be finished.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> >>
> >> 2) Trigger a recovery:
> >>
> >>    raid10_sync_request
> >>     raise_barrier
> >>
> >> Dependency chain 2: sync thread is waiting for normal io
> >>
> >> 3) echo idle/frozen to sync_action:
> >>
> >>    action_store
> >>     mddev_lock
> >>      md_unregister_thread
> >>       kthread_stop
> >>
> >> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
> >>
> >> 4) md thread can't update superblock:
> >>
> >>    raid10d
> >>     md_check_recovery
> >>      if (mddev_trylock(mddev))
> >>       md_update_sb
> >>
> >> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
> >>
> >> Hence cyclic dependency exist, in order to fix the problem, we must
> >> break one of them. Dependency 1 and 2 can't be broken because they are
> >> foundation design. Dependency 4 may be possible if it can be guaranteed
> >> that no io can be inflight, however, this requires a new mechanism which
> >> seems complex. Dependency 3 is a good choice, because idle/frozen only
> >> requires sync thread to finish, which can be done asynchronously that is
> >> already implemented, and 'reconfig_mutex' is not needed anymore.
> >>
> >> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> >> asynchronously, and this patch also add a sequence counter to record how
> >> many times sync thread is done, so that 'idle' won't keep waiting on new
> >> started sync thread.
> >
> > In the patch, sync_seq is added in md_reap_sync_thread. In
> > idle_sync_thread, if sync_seq isn't equal
> >
> > mddev->sync_seq, it should mean there is someone that stops the sync
> > thread already, right? Why do
> >
> > you say 'new started sync thread' here?
> >
> > Regards
> >
> > Xiao
> >
> >
> >>
> >> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> >> deadlock can be fixed by this patch as well.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> >>
> >> [2]
> >> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
> >>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >> ---
> >>   drivers/md/md.c | 23 +++++++++++++++++++----
> >>   drivers/md/md.h |  2 ++
> >>   2 files changed, 21 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 63a993b52cd7..7912de0e4d12 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
> >>       timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
> >>       atomic_set(&mddev->active, 1);
> >>       atomic_set(&mddev->openers, 0);
> >> +    atomic_set(&mddev->sync_seq, 0);
> >>       spin_lock_init(&mddev->lock);
> >>       atomic_set(&mddev->flush_pending, 0);
> >>       init_waitqueue_head(&mddev->sb_wait);
> >> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
> >>       if (work_pending(&mddev->del_work))
> >>           flush_workqueue(md_misc_wq);
> >> -    if (mddev->sync_thread) {
> >> -        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >> -        md_reap_sync_thread(mddev);
> >> -    }
> >> +    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >> +    /*
> >> +     * Thread might be blocked waiting for metadata update which will
> >> now
> >> +     * never happen
> >> +     */
> >> +    md_wakeup_thread_directly(mddev->sync_thread);
> >>       mddev_unlock(mddev);
> >>   }
> >>   static void idle_sync_thread(struct mddev *mddev)
> >>   {
> >> +    int sync_seq = atomic_read(&mddev->sync_seq);
> >> +
> >>       mutex_lock(&mddev->sync_mutex);
> >>       clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>       stop_sync_thread(mddev);
> >> +
> >> +    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> >> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >> +
> >>       mutex_unlock(&mddev->sync_mutex);
> >>   }
> >> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
> >> *mddev)
> >>       mutex_init(&mddev->delete_mutex);
> >>       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>       stop_sync_thread(mddev);
> >> +
> >> +    wait_event(resync_wait, mddev->sync_thread == NULL &&
> >> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >> +
> >>       mutex_unlock(&mddev->sync_mutex);
> >>   }
> >> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
> >>       /* resync has finished, collect result */
> >>       md_unregister_thread(&mddev->sync_thread);
> >> +    atomic_inc(&mddev->sync_seq);
> >> +
> >>       if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> >>           !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> >>           mddev->degraded != mddev->raid_disks) {
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 2fa903de5bd0..7cab9c7c45b8 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -539,6 +539,8 @@ struct mddev {
> >>       /* Used to synchronize idle and frozen for action_store() */
> >>       struct mutex            sync_mutex;
> >> +    /* The sequence number for sync thread */
> >> +    atomic_t sync_seq;
> >>       bool    has_superblocks:1;
> >>       bool    fail_last_dev:1;
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://listman.redhat.com/mailman/listinfo/dm-devel
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/14 11:47, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/06/13 22:50, Xiao Ni 写道:
>>>
>>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Our test found a following deadlock in raid10:
>>>>
>>>> 1) Issue a normal write, and such write failed:
>>>>
>>>>     raid10_end_write_request
>>>>      set_bit(R10BIO_WriteError, &r10_bio->state)
>>>>      one_write_done
>>>>       reschedule_retry
>>>>
>>>>     // later from md thread
>>>>     raid10d
>>>>      handle_write_completed
>>>>       list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>>>>
>>>>     // later from md thread
>>>>     raid10d
>>>>      if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>>>       list_move(conf->bio_end_io_list.prev, &tmp)
>>>>       r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>>>>       raid_end_bio_io(r10_bio)
>>>>
>>>> Dependency chain 1: normal io is waiting for updating superblock
>>>
>>> Hi Kuai
>>>
>>> It looks like the above situation is more complex. It only needs a
>>> normal write and md_write_start needs to
>>>
>>> wait until the metadata is written to member disks, right? If so, it
>>> doesn't need to introduce raid10 write failure
>>>
>>> here. I guess, it should be your test case. It's nice, if you can put
>>> your test steps in the patch. But for the analysis
>>>
>>> of the deadlock here, it's better to be simple.
>>
>> Test script can be found here, it's pretty easy to trigger:
>>
>> https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@huaweicloud.com/
> 
> Thanks for this.
>>
>> While reviewing the related code, I found that io can only be added to
>> list bio_end_io_list from handle_write_completed() if such io failed, so
>> I think io failure is needed to trigger deadlock from daemon thread.
>>
>> I think the key point is how MD_SB_CHANGE_PENDING is set:
>>
>> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
>> 2) raid10_write_request() related to reshape;
>> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
>> however, I was thinking this is not a common case;
>>
>> 1) is used here because it's quite easy to trigger and this is what
>> we meet in real test. 3) is possible but I will say let's keep 1), I
>> don't think it's necessary to reporduce this deadlock through another
>> path again.
> 
> It makes sense. Let's go back to the first path mentioned in the patch.
> 
>> 1) Issue a normal write, and such write failed:
>>
>>     raid10_end_write_request
>>      set_bit(R10BIO_WriteError, &r10_bio->state)
>>      one_write_done
>>       reschedule_retry
> 
> This is good.
>>
>>     // later from md thread
>>     raid10d
>>      handle_write_completed
>>       list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> 
> I have a question here. It should run narrow_write_error in
> handle_write_completed. In the test case, will narrow_write_error run
> successfully? Or it fails and will call rdev_set_badblocks and
> md_error. So MD_RECOVERY_PENDING will be set?

r10_bio will always be added to bio_end_io_list, no matter
narrow_write_error() succeed or not. The dependecy chain 1 here is just
indicate handle this r10_bio will wait for updating super block, it's
not where MD_RECOVERY_PENDING is set...

And MD_RECOVERY_PENDING can be set from narrow_write_error() and other
places where rdev_set_badblocks() is called.
> 
>>
>>     // later from md thread
>>     raid10d
>>      if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
	-> It's here, if the flag is set, bio won't be handled.
>>       list_move(conf->bio_end_io_list.prev, &tmp)
>>       r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>>       raid_end_bio_io(r10_bio)
>>
>> Dependency chain 1: normal io is waiting for updating superblock
> 
> It's a little hard to understand. Because it doesn't show how normal
> io waits for a superblock update. And based on your last email, I
> guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
> the flag can't be cleared, so the bios can't be added to
> bio_end_io_list, so the io rquests can't be finished.

It's not that bio can't be added to bio_end_io_list, it's that bio in
this list can't be handled if sb_flags is set.

Thanks,
Kuai
> 
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>>>
>>>> 2) Trigger a recovery:
>>>>
>>>>     raid10_sync_request
>>>>      raise_barrier
>>>>
>>>> Dependency chain 2: sync thread is waiting for normal io
>>>>
>>>> 3) echo idle/frozen to sync_action:
>>>>
>>>>     action_store
>>>>      mddev_lock
>>>>       md_unregister_thread
>>>>        kthread_stop
>>>>
>>>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>>>>
>>>> 4) md thread can't update superblock:
>>>>
>>>>     raid10d
>>>>      md_check_recovery
>>>>       if (mddev_trylock(mddev))
>>>>        md_update_sb
>>>>
>>>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>>>>
>>>> Hence cyclic dependency exist, in order to fix the problem, we must
>>>> break one of them. Dependency 1 and 2 can't be broken because they are
>>>> foundation design. Dependency 4 may be possible if it can be guaranteed
>>>> that no io can be inflight, however, this requires a new mechanism which
>>>> seems complex. Dependency 3 is a good choice, because idle/frozen only
>>>> requires sync thread to finish, which can be done asynchronously that is
>>>> already implemented, and 'reconfig_mutex' is not needed anymore.
>>>>
>>>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
>>>> asynchronously, and this patch also add a sequence counter to record how
>>>> many times sync thread is done, so that 'idle' won't keep waiting on new
>>>> started sync thread.
>>>
>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>> idle_sync_thread, if sync_seq isn't equal
>>>
>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>> thread already, right? Why do
>>>
>>> you say 'new started sync thread' here?
>>>
>>> Regards
>>>
>>> Xiao
>>>
>>>
>>>>
>>>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
>>>> deadlock can be fixed by this patch as well.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
>>>>
>>>> [2]
>>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    drivers/md/md.c | 23 +++++++++++++++++++----
>>>>    drivers/md/md.h |  2 ++
>>>>    2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 63a993b52cd7..7912de0e4d12 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
>>>>        timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>>>>        atomic_set(&mddev->active, 1);
>>>>        atomic_set(&mddev->openers, 0);
>>>> +    atomic_set(&mddev->sync_seq, 0);
>>>>        spin_lock_init(&mddev->lock);
>>>>        atomic_set(&mddev->flush_pending, 0);
>>>>        init_waitqueue_head(&mddev->sb_wait);
>>>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
>>>>        if (work_pending(&mddev->del_work))
>>>>            flush_workqueue(md_misc_wq);
>>>> -    if (mddev->sync_thread) {
>>>> -        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>> -        md_reap_sync_thread(mddev);
>>>> -    }
>>>> +    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>> +    /*
>>>> +     * Thread might be blocked waiting for metadata update which will
>>>> now
>>>> +     * never happen
>>>> +     */
>>>> +    md_wakeup_thread_directly(mddev->sync_thread);
>>>>        mddev_unlock(mddev);
>>>>    }
>>>>    static void idle_sync_thread(struct mddev *mddev)
>>>>    {
>>>> +    int sync_seq = atomic_read(&mddev->sync_seq);
>>>> +
>>>>        mutex_lock(&mddev->sync_mutex);
>>>>        clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>        stop_sync_thread(mddev);
>>>> +
>>>> +    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
>>>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>>>> +
>>>>        mutex_unlock(&mddev->sync_mutex);
>>>>    }
>>>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
>>>> *mddev)
>>>>        mutex_init(&mddev->delete_mutex);
>>>>        set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>        stop_sync_thread(mddev);
>>>> +
>>>> +    wait_event(resync_wait, mddev->sync_thread == NULL &&
>>>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>>>> +
>>>>        mutex_unlock(&mddev->sync_mutex);
>>>>    }
>>>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>>>>        /* resync has finished, collect result */
>>>>        md_unregister_thread(&mddev->sync_thread);
>>>> +    atomic_inc(&mddev->sync_seq);
>>>> +
>>>>        if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>>>>            !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>>>>            mddev->degraded != mddev->raid_disks) {
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 2fa903de5bd0..7cab9c7c45b8 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -539,6 +539,8 @@ struct mddev {
>>>>        /* Used to synchronize idle and frozen for action_store() */
>>>>        struct mutex            sync_mutex;
>>>> +    /* The sequence number for sync thread */
>>>> +    atomic_t sync_seq;
>>>>        bool    has_superblocks:1;
>>>>        bool    fail_last_dev:1;
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://listman.redhat.com/mailman/listinfo/dm-devel
>>
> 
> .
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Wed, Jun 14, 2023 at 2:05 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/14 11:47, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/13 22:50, Xiao Ni 写道:
> >>>
> >>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >>>> From: Yu Kuai <yukuai3@huawei.com>
> >>>>
> >>>> Our test found a following deadlock in raid10:
> >>>>
> >>>> 1) Issue a normal write, and such write failed:
> >>>>
> >>>>     raid10_end_write_request
> >>>>      set_bit(R10BIO_WriteError, &r10_bio->state)
> >>>>      one_write_done
> >>>>       reschedule_retry
> >>>>
> >>>>     // later from md thread
> >>>>     raid10d
> >>>>      handle_write_completed
> >>>>       list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >>>>
> >>>>     // later from md thread
> >>>>     raid10d
> >>>>      if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> >>>>       list_move(conf->bio_end_io_list.prev, &tmp)
> >>>>       r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >>>>       raid_end_bio_io(r10_bio)
> >>>>
> >>>> Dependency chain 1: normal io is waiting for updating superblock
> >>>
> >>> Hi Kuai
> >>>
> >>> It looks like the above situation is more complex. It only needs a
> >>> normal write and md_write_start needs to
> >>>
> >>> wait until the metadata is written to member disks, right? If so, it
> >>> doesn't need to introduce raid10 write failure
> >>>
> >>> here. I guess, it should be your test case. It's nice, if you can put
> >>> your test steps in the patch. But for the analysis
> >>>
> >>> of the deadlock here, it's better to be simple.
> >>
> >> Test script can be found here, it's pretty easy to trigger:
> >>
> >> https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@huaweicloud.com/
> >
> > Thanks for this.
> >>
> >> While reviewing the related code, I found that io can only be added to
> >> list bio_end_io_list from handle_write_completed() if such io failed, so
> >> I think io failure is needed to trigger deadlock from daemon thread.
> >>
> >> I think the key point is how MD_SB_CHANGE_PENDING is set:
> >>
> >> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
> >> 2) raid10_write_request() related to reshape;
> >> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
> >> however, I was thinking this is not a common case;
> >>
> >> 1) is used here because it's quite easy to trigger and this is what
> >> we meet in real test. 3) is possible but I will say let's keep 1), I
> >> don't think it's necessary to reporduce this deadlock through another
> >> path again.
> >
> > It makes sense. Let's go back to the first path mentioned in the patch.
> >
> >> 1) Issue a normal write, and such write failed:
> >>
> >>     raid10_end_write_request
> >>      set_bit(R10BIO_WriteError, &r10_bio->state)
> >>      one_write_done
> >>       reschedule_retry
> >
> > This is good.
> >>
> >>     // later from md thread
> >>     raid10d
> >>      handle_write_completed
> >>       list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >
> > I have a question here. It should run narrow_write_error in
> > handle_write_completed. In the test case, will narrow_write_error run
> > successfully? Or it fails and will call rdev_set_badblocks and
> > md_error. So MD_RECOVERY_PENDING will be set?
>
> r10_bio will always be added to bio_end_io_list, no matter
> narrow_write_error() succeed or not. The dependecy chain 1 here is just
> indicate handle this r10_bio will wait for updating super block, it's
> not where MD_RECOVERY_PENDING is set...
>
> And MD_RECOVERY_PENDING can be set from narrow_write_error() and other
> places where rdev_set_badblocks() is called.

Because in your patch, it doesn't show which step sets
MD_RECOVERY_PENDING. It's the reason I need to guess. It's a normal
write, so md_write_start can set the flag. In this case, it can cause
the same deadlock. So it's better to give which step sets the flag.
> >
> >>
> >>     // later from md thread
> >>     raid10d
> >>      if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>         -> It's here, if the flag is set, bio won't be handled.

Yes.

> >>       list_move(conf->bio_end_io_list.prev, &tmp)
> >>       r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >>       raid_end_bio_io(r10_bio)
> >>
> >> Dependency chain 1: normal io is waiting for updating superblock
> >
> > It's a little hard to understand. Because it doesn't show how normal
> > io waits for a superblock update. And based on your last email, I
> > guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
> > the flag can't be cleared, so the bios can't be added to
> > bio_end_io_list, so the io rquests can't be finished.
>
> It's not that bio can't be added to bio_end_io_list, it's that bio in
> this list can't be handled if sb_flags is set.

Sorry for this. I wanted to say the same thing. I understand the case totally.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>>>
> >>>> 2) Trigger a recovery:
> >>>>
> >>>>     raid10_sync_request
> >>>>      raise_barrier
> >>>>
> >>>> Dependency chain 2: sync thread is waiting for normal io
> >>>>
> >>>> 3) echo idle/frozen to sync_action:
> >>>>
> >>>>     action_store
> >>>>      mddev_lock
> >>>>       md_unregister_thread
> >>>>        kthread_stop
> >>>>
> >>>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
> >>>>
> >>>> 4) md thread can't update superblock:
> >>>>
> >>>>     raid10d
> >>>>      md_check_recovery
> >>>>       if (mddev_trylock(mddev))
> >>>>        md_update_sb
> >>>>
> >>>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
> >>>>
> >>>> Hence cyclic dependency exist, in order to fix the problem, we must
> >>>> break one of them. Dependency 1 and 2 can't be broken because they are
> >>>> foundation design. Dependency 4 may be possible if it can be guaranteed
> >>>> that no io can be inflight, however, this requires a new mechanism which
> >>>> seems complex. Dependency 3 is a good choice, because idle/frozen only
> >>>> requires sync thread to finish, which can be done asynchronously that is
> >>>> already implemented, and 'reconfig_mutex' is not needed anymore.
> >>>>
> >>>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> >>>> asynchronously, and this patch also add a sequence counter to record how
> >>>> many times sync thread is done, so that 'idle' won't keep waiting on new
> >>>> started sync thread.
> >>>
> >>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>> idle_sync_thread, if sync_seq isn't equal
> >>>
> >>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>> thread already, right? Why do
> >>>
> >>> you say 'new started sync thread' here?
> >>>
> >>> Regards
> >>>
> >>> Xiao
> >>>
> >>>
> >>>>
> >>>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> >>>> deadlock can be fixed by this patch as well.
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
> >>>>
> >>>> [2]
> >>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
> >>>>
> >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >>>> ---
> >>>>    drivers/md/md.c | 23 +++++++++++++++++++----
> >>>>    drivers/md/md.h |  2 ++
> >>>>    2 files changed, 21 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>> index 63a993b52cd7..7912de0e4d12 100644
> >>>> --- a/drivers/md/md.c
> >>>> +++ b/drivers/md/md.c
> >>>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
> >>>>        timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
> >>>>        atomic_set(&mddev->active, 1);
> >>>>        atomic_set(&mddev->openers, 0);
> >>>> +    atomic_set(&mddev->sync_seq, 0);
> >>>>        spin_lock_init(&mddev->lock);
> >>>>        atomic_set(&mddev->flush_pending, 0);
> >>>>        init_waitqueue_head(&mddev->sb_wait);
> >>>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
> >>>>        if (work_pending(&mddev->del_work))
> >>>>            flush_workqueue(md_misc_wq);
> >>>> -    if (mddev->sync_thread) {
> >>>> -        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >>>> -        md_reap_sync_thread(mddev);
> >>>> -    }
> >>>> +    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >>>> +    /*
> >>>> +     * Thread might be blocked waiting for metadata update which will
> >>>> now
> >>>> +     * never happen
> >>>> +     */
> >>>> +    md_wakeup_thread_directly(mddev->sync_thread);
> >>>>        mddev_unlock(mddev);
> >>>>    }
> >>>>    static void idle_sync_thread(struct mddev *mddev)
> >>>>    {
> >>>> +    int sync_seq = atomic_read(&mddev->sync_seq);
> >>>> +
> >>>>        mutex_lock(&mddev->sync_mutex);
> >>>>        clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>>>        stop_sync_thread(mddev);
> >>>> +
> >>>> +    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> >>>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >>>> +
> >>>>        mutex_unlock(&mddev->sync_mutex);
> >>>>    }
> >>>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
> >>>> *mddev)
> >>>>        mutex_init(&mddev->delete_mutex);
> >>>>        set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>>>        stop_sync_thread(mddev);
> >>>> +
> >>>> +    wait_event(resync_wait, mddev->sync_thread == NULL &&
> >>>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >>>> +
> >>>>        mutex_unlock(&mddev->sync_mutex);
> >>>>    }
> >>>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
> >>>>        /* resync has finished, collect result */
> >>>>        md_unregister_thread(&mddev->sync_thread);
> >>>> +    atomic_inc(&mddev->sync_seq);
> >>>> +
> >>>>        if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> >>>>            !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> >>>>            mddev->degraded != mddev->raid_disks) {
> >>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >>>> index 2fa903de5bd0..7cab9c7c45b8 100644
> >>>> --- a/drivers/md/md.h
> >>>> +++ b/drivers/md/md.h
> >>>> @@ -539,6 +539,8 @@ struct mddev {
> >>>>        /* Used to synchronize idle and frozen for action_store() */
> >>>>        struct mutex            sync_mutex;
> >>>> +    /* The sequence number for sync thread */
> >>>> +    atomic_t sync_seq;
> >>>>        bool    has_superblocks:1;
> >>>>        bool    fail_last_dev:1;
> >>>
> >>> --
> >>> dm-devel mailing list
> >>> dm-devel@redhat.com
> >>> https://listman.redhat.com/mailman/listinfo/dm-devel
> >>
> >
> > .
> >
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/14 9:48, Yu Kuai 写道:


>>
>> In the patch, sync_seq is added in md_reap_sync_thread. In 
>> idle_sync_thread, if sync_seq isn't equal
>>
>> mddev->sync_seq, it should mean there is someone that stops the sync 
>> thread already, right? Why do
>>
>> you say 'new started sync thread' here?

If someone stops the sync thread, and new sync thread is not started,
then this sync_seq won't make a difference, above wait_event() will not
wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
So 'sync_seq' is only used when the old sync thread stops and new sync
thread starts, add 'sync_seq' will bypass this case.

Thanks,
Kuai

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/14 9:48, Yu Kuai 写道:
>
>
> >>
> >> In the patch, sync_seq is added in md_reap_sync_thread. In
> >> idle_sync_thread, if sync_seq isn't equal
> >>
> >> mddev->sync_seq, it should mean there is someone that stops the sync
> >> thread already, right? Why do
> >>
> >> you say 'new started sync thread' here?
>
> If someone stops the sync thread, and new sync thread is not started,
> then this sync_seq won't make a difference, above wait_event() will not
> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> So 'sync_seq' is only used when the old sync thread stops and new sync
> thread starts, add 'sync_seq' will bypass this case.

Hi

If a new sync thread starts, why can sync_seq be different? sync_seq
is only added in md_reap_sync_thread. And when a new sync request
starts, it can't stop the sync request again?

Af first, the sync_seq is 0

admin1
echo idle > sync_action
idle_sync_thread(sync_seq is 1)
echo resync > sync_action (new sync)

Then admin2 echos idle > sync_action, sync_seq is still 1

Regards
Xiao

>
> Thanks,
> Kuai
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/14 15:12, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>
>>
>>>>
>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>> idle_sync_thread, if sync_seq isn't equal
>>>>
>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>> thread already, right? Why do
>>>>
>>>> you say 'new started sync thread' here?
>>
>> If someone stops the sync thread, and new sync thread is not started,
>> then this sync_seq won't make a difference, above wait_event() will not
>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>> So 'sync_seq' is only used when the old sync thread stops and new sync
>> thread starts, add 'sync_seq' will bypass this case.
> 
> Hi
> 
> If a new sync thread starts, why can sync_seq be different? sync_seq
> is only added in md_reap_sync_thread. And when a new sync request
> starts, it can't stop the sync request again?
> 
> Af first, the sync_seq is 0
> 
> admin1
> echo idle > sync_action
> idle_sync_thread(sync_seq is 1)

Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
mean that there is a sync_thread just finished?

Then the problem is that idle_sync_thread() read sync_seq after the old
sync_thread is done, and new sync_thread start before wait_event() is
called, should we wait for this new sync_thread?

My answer here is that we should, but I'm also ok to not wait this new
sync_thread, I don't think this behaviour matters. The key point here
is that once wait_event() is called from idle_sync_thread(), this
wait_event() should not wait for new sync_thread...

> echo resync > sync_action (new sync)

If this is behind "echo idle > sync_action", idle_sync_thread should not
see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.

Thanks,
Kuai
> 
> Then admin2 echos idle > sync_action, sync_seq is still 1
> 
> Regards
> Xiao
> 
>>
>> Thanks,
>> Kuai
>>
> 
> .
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/14 15:12, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>
> >>
> >>>>
> >>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>> idle_sync_thread, if sync_seq isn't equal
> >>>>
> >>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>> thread already, right? Why do
> >>>>
> >>>> you say 'new started sync thread' here?
> >>
> >> If someone stops the sync thread, and new sync thread is not started,
> >> then this sync_seq won't make a difference, above wait_event() will not
> >> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >> So 'sync_seq' is only used when the old sync thread stops and new sync
> >> thread starts, add 'sync_seq' will bypass this case.
> >
> > Hi
> >
> > If a new sync thread starts, why can sync_seq be different? sync_seq
> > is only added in md_reap_sync_thread. And when a new sync request
> > starts, it can't stop the sync request again?
> >
> > Af first, the sync_seq is 0
> >
> > admin1
> > echo idle > sync_action
> > idle_sync_thread(sync_seq is 1)
>
> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> mean that there is a sync_thread just finished?

Hi Kuai

Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
patch right?

>
> Then the problem is that idle_sync_thread() read sync_seq after the old
> sync_thread is done, and new sync_thread start before wait_event() is
> called, should we wait for this new sync_thread?
>
> My answer here is that we should, but I'm also ok to not wait this new
> sync_thread, I don't think this behaviour matters. The key point here
> is that once wait_event() is called from idle_sync_thread(), this
> wait_event() should not wait for new sync_thread...

I think we should wait. If we don't wait for it, there is a problem.
One person echos idle to sync_action and it doesn't work sometimes.
It's a strange thing.

>
> > echo resync > sync_action (new sync)
>
> If this is behind "echo idle > sync_action", idle_sync_thread should not
> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.

`echo resync > sync_action` can't change the sync_seq. So 'echo idle >
sync_action' still waits until MD_RECOVERY_RUNNING is cleared?

Regards
Xiao

>
> Thanks,
> Kuai
> >
> > Then admin2 echos idle > sync_action, sync_seq is still 1
> >
> > Regards
> > Xiao
> >
> >>
> >> Thanks,
> >> Kuai
> >>
> >
> > .
> >
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/14 15:57, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 15:12, Xiao Ni 写道:
>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>>>
>>>>
>>>>>>
>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>>>> idle_sync_thread, if sync_seq isn't equal
>>>>>>
>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>>>> thread already, right? Why do
>>>>>>
>>>>>> you say 'new started sync thread' here?
>>>>
>>>> If someone stops the sync thread, and new sync thread is not started,
>>>> then this sync_seq won't make a difference, above wait_event() will not
>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
>>>> thread starts, add 'sync_seq' will bypass this case.
>>>
>>> Hi
>>>
>>> If a new sync thread starts, why can sync_seq be different? sync_seq
>>> is only added in md_reap_sync_thread. And when a new sync request
>>> starts, it can't stop the sync request again?
>>>
>>> Af first, the sync_seq is 0
>>>
>>> admin1
>>> echo idle > sync_action
>>> idle_sync_thread(sync_seq is 1)
>>
>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
>> mean that there is a sync_thread just finished?
> 
> Hi Kuai
> 
> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> patch right?

Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
is set.

> 
>>
>> Then the problem is that idle_sync_thread() read sync_seq after the old
>> sync_thread is done, and new sync_thread start before wait_event() is
>> called, should we wait for this new sync_thread?
>>
>> My answer here is that we should, but I'm also ok to not wait this new
>> sync_thread, I don't think this behaviour matters. The key point here
>> is that once wait_event() is called from idle_sync_thread(), this
>> wait_event() should not wait for new sync_thread...
> 
> I think we should wait. If we don't wait for it, there is a problem.
> One person echos idle to sync_action and it doesn't work sometimes.
> It's a strange thing.
> 

Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
for new sync_thread that is started after wait_event().

>>
>>> echo resync > sync_action (new sync)
>>
>> If this is behind "echo idle > sync_action", idle_sync_thread should not
>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> 
> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?

This is not accurate, if `echo resync > sync_action` triggers a new
sync_thread, then sync_seq is updated when this sync_thread is done,
during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
 >sync_action` will wait for sync_thread to be done.

Thanks,
Kuai
> 
> Regards
> Xiao
> 
>>
>> Thanks,
>> Kuai
>>>
>>> Then admin2 echos idle > sync_action, sync_seq is still 1
>>>
>>> Regards
>>> Xiao
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>
>>> .
>>>
>>
> 
> .
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/14 15:57, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 15:12, Xiao Ni 写道:
> >>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>>>
> >>>>
> >>>>>>
> >>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>>>> idle_sync_thread, if sync_seq isn't equal
> >>>>>>
> >>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>>>> thread already, right? Why do
> >>>>>>
> >>>>>> you say 'new started sync thread' here?
> >>>>
> >>>> If someone stops the sync thread, and new sync thread is not started,
> >>>> then this sync_seq won't make a difference, above wait_event() will not
> >>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> >>>> thread starts, add 'sync_seq' will bypass this case.
> >>>
> >>> Hi
> >>>
> >>> If a new sync thread starts, why can sync_seq be different? sync_seq
> >>> is only added in md_reap_sync_thread. And when a new sync request
> >>> starts, it can't stop the sync request again?
> >>>
> >>> Af first, the sync_seq is 0
> >>>
> >>> admin1
> >>> echo idle > sync_action
> >>> idle_sync_thread(sync_seq is 1)
> >>
> >> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> >> mean that there is a sync_thread just finished?
> >
> > Hi Kuai
> >
> > Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> > finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> > patch right?
>
> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> is set.
>
> >
> >>
> >> Then the problem is that idle_sync_thread() read sync_seq after the old
> >> sync_thread is done, and new sync_thread start before wait_event() is
> >> called, should we wait for this new sync_thread?
> >>
> >> My answer here is that we should, but I'm also ok to not wait this new
> >> sync_thread, I don't think this behaviour matters. The key point here
> >> is that once wait_event() is called from idle_sync_thread(), this
> >> wait_event() should not wait for new sync_thread...
> >
> > I think we should wait. If we don't wait for it, there is a problem.
> > One person echos idle to sync_action and it doesn't work sometimes.
> > It's a strange thing.
> >
>
> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> for new sync_thread that is started after wait_event().

I suggest removing this function. Without this change, it's more
simple and it can work well without problem. The people that echo idle
to sync_action needs to wait until the sync action finishes. The code
semantic is clear and simple.
>
> >>
> >>> echo resync > sync_action (new sync)
> >>
> >> If this is behind "echo idle > sync_action", idle_sync_thread should not
> >> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> >
> > `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> > sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
>
> This is not accurate, if `echo resync > sync_action` triggers a new
> sync_thread, then sync_seq is updated when this sync_thread is done,
> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
>  >sync_action` will wait for sync_thread to be done.

I can understand your comment, but sorry, I still can't get how
sync_seq works. Could you give a specific case that explains how it
works?

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Then admin2 echos idle > sync_action, sync_seq is still 1
> >>>
> >>> Regards
> >>> Xiao
> >>>
> >>>>
> >>>> Thanks,
> >>>> Kuai
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/14 17:08, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 15:57, Xiao Ni 写道:
>>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/06/14 15:12, Xiao Ni 写道:
>>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>>>>>> idle_sync_thread, if sync_seq isn't equal
>>>>>>>>
>>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>>>>>> thread already, right? Why do
>>>>>>>>
>>>>>>>> you say 'new started sync thread' here?
>>>>>>
>>>>>> If someone stops the sync thread, and new sync thread is not started,
>>>>>> then this sync_seq won't make a difference, above wait_event() will not
>>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
>>>>>> thread starts, add 'sync_seq' will bypass this case.
>>>>>
>>>>> Hi
>>>>>
>>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
>>>>> is only added in md_reap_sync_thread. And when a new sync request
>>>>> starts, it can't stop the sync request again?
>>>>>
>>>>> Af first, the sync_seq is 0
>>>>>
>>>>> admin1
>>>>> echo idle > sync_action
>>>>> idle_sync_thread(sync_seq is 1)
>>>>
>>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
>>>> mean that there is a sync_thread just finished?
>>>
>>> Hi Kuai
>>>
>>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
>>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
>>> patch right?
>>
>> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
>> is set.
>>
>>>
>>>>
>>>> Then the problem is that idle_sync_thread() read sync_seq after the old
>>>> sync_thread is done, and new sync_thread start before wait_event() is
>>>> called, should we wait for this new sync_thread?
>>>>
>>>> My answer here is that we should, but I'm also ok to not wait this new
>>>> sync_thread, I don't think this behaviour matters. The key point here
>>>> is that once wait_event() is called from idle_sync_thread(), this
>>>> wait_event() should not wait for new sync_thread...
>>>
>>> I think we should wait. If we don't wait for it, there is a problem.
>>> One person echos idle to sync_action and it doesn't work sometimes.
>>> It's a strange thing.
>>>
>>
>> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
>> for new sync_thread that is started after wait_event().
> 
> I suggest removing this function. Without this change, it's more
> simple and it can work well without problem. The people that echo idle
> to sync_action needs to wait until the sync action finishes. The code
> semantic is clear and simple.
>>
>>>>
>>>>> echo resync > sync_action (new sync)
>>>>
>>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
>>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
>>>
>>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
>>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
>>
>> This is not accurate, if `echo resync > sync_action` triggers a new
>> sync_thread, then sync_seq is updated when this sync_thread is done,
>> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
>>   >sync_action` will wait for sync_thread to be done.
> 
> I can understand your comment, but sorry, I still can't get how
> sync_seq works. Could you give a specific case that explains how it
> works?

Ok, the problem is that echo ilde is supposed to interrupt sync_thread
and stop sync_thread quickly. Now that we don't hold mutex here, we
can't prevent new sync_thread to start. For exapmle:

1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;

2) echo idle, A will be interrupted, mutex is not hold and
idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.

3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
idle_sync_thread(), however, before idle_sync_thread() is woken, A can
be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
will be set again.

4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
set and it will still waiting. And this time B won't be interrupted.

Thanks,
Kuai

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Thu, Jun 15, 2023 at 9:29 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/14 17:08, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 15:57, Xiao Ni 写道:
> >>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/06/14 15:12, Xiao Ni 写道:
> >>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>>>>>> idle_sync_thread, if sync_seq isn't equal
> >>>>>>>>
> >>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>>>>>> thread already, right? Why do
> >>>>>>>>
> >>>>>>>> you say 'new started sync thread' here?
> >>>>>>
> >>>>>> If someone stops the sync thread, and new sync thread is not started,
> >>>>>> then this sync_seq won't make a difference, above wait_event() will not
> >>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> >>>>>> thread starts, add 'sync_seq' will bypass this case.
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
> >>>>> is only added in md_reap_sync_thread. And when a new sync request
> >>>>> starts, it can't stop the sync request again?
> >>>>>
> >>>>> Af first, the sync_seq is 0
> >>>>>
> >>>>> admin1
> >>>>> echo idle > sync_action
> >>>>> idle_sync_thread(sync_seq is 1)
> >>>>
> >>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> >>>> mean that there is a sync_thread just finished?
> >>>
> >>> Hi Kuai
> >>>
> >>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> >>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> >>> patch right?
> >>
> >> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> >> is set.
> >>
> >>>
> >>>>
> >>>> Then the problem is that idle_sync_thread() read sync_seq after the old
> >>>> sync_thread is done, and new sync_thread start before wait_event() is
> >>>> called, should we wait for this new sync_thread?
> >>>>
> >>>> My answer here is that we should, but I'm also ok to not wait this new
> >>>> sync_thread, I don't think this behaviour matters. The key point here
> >>>> is that once wait_event() is called from idle_sync_thread(), this
> >>>> wait_event() should not wait for new sync_thread...
> >>>
> >>> I think we should wait. If we don't wait for it, there is a problem.
> >>> One person echos idle to sync_action and it doesn't work sometimes.
> >>> It's a strange thing.
> >>>
> >>
> >> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> >> for new sync_thread that is started after wait_event().
> >
> > I suggest removing this function. Without this change, it's more
> > simple and it can work well without problem. The people that echo idle
> > to sync_action needs to wait until the sync action finishes. The code
> > semantic is clear and simple.
> >>
> >>>>
> >>>>> echo resync > sync_action (new sync)
> >>>>
> >>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
> >>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> >>>
> >>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> >>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
> >>
> >> This is not accurate, if `echo resync > sync_action` triggers a new
> >> sync_thread, then sync_seq is updated when this sync_thread is done,
> >> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
> >>   >sync_action` will wait for sync_thread to be done.
> >
> > I can understand your comment, but sorry, I still can't get how
> > sync_seq works. Could you give a specific case that explains how it
> > works?
>
> Ok, the problem is that echo ilde is supposed to interrupt sync_thread
> and stop sync_thread quickly. Now that we don't hold mutex here, we
> can't prevent new sync_thread to start. For exapmle:
>
> 1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;
>
> 2) echo idle, A will be interrupted, mutex is not hold and
> idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.
>
> 3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
> idle_sync_thread(), however, before idle_sync_thread() is woken, A can
> be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
> will be set again.
>
> 4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
> set and it will still waiting. And this time B won't be interrupted.

Thanks for the example. I can understand the usage of it. It's the
side effect that removes the mutex protection for idle_sync_thread.

There is a problem. New sync thread is started in md_check_recovery.
After your patch, md_reap_sync_thread is called in md_check_recovery
too. So it looks like they can't happen at the same time?

Regards
Xiao

>
> Thanks,
> Kuai
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Thu, Jun 15, 2023 at 4:01 PM Xiao Ni <xni@redhat.com> wrote:
>
> On Thu, Jun 15, 2023 at 9:29 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > 在 2023/06/14 17:08, Xiao Ni 写道:
> > > On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> 在 2023/06/14 15:57, Xiao Ni 写道:
> > >>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> 在 2023/06/14 15:12, Xiao Ni 写道:
> > >>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> > >>>>>>
> > >>>>>>
> > >>>>>>>>
> > >>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> > >>>>>>>> idle_sync_thread, if sync_seq isn't equal
> > >>>>>>>>
> > >>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> > >>>>>>>> thread already, right? Why do
> > >>>>>>>>
> > >>>>>>>> you say 'new started sync thread' here?
> > >>>>>>
> > >>>>>> If someone stops the sync thread, and new sync thread is not started,
> > >>>>>> then this sync_seq won't make a difference, above wait_event() will not
> > >>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> > >>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> > >>>>>> thread starts, add 'sync_seq' will bypass this case.
> > >>>>>
> > >>>>> Hi
> > >>>>>
> > >>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
> > >>>>> is only added in md_reap_sync_thread. And when a new sync request
> > >>>>> starts, it can't stop the sync request again?
> > >>>>>
> > >>>>> Af first, the sync_seq is 0
> > >>>>>
> > >>>>> admin1
> > >>>>> echo idle > sync_action
> > >>>>> idle_sync_thread(sync_seq is 1)
> > >>>>
> > >>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> > >>>> mean that there is a sync_thread just finished?
> > >>>
> > >>> Hi Kuai
> > >>>
> > >>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> > >>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> > >>> patch right?
> > >>
> > >> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> > >> is set.
> > >>
> > >>>
> > >>>>
> > >>>> Then the problem is that idle_sync_thread() read sync_seq after the old
> > >>>> sync_thread is done, and new sync_thread start before wait_event() is
> > >>>> called, should we wait for this new sync_thread?
> > >>>>
> > >>>> My answer here is that we should, but I'm also ok to not wait this new
> > >>>> sync_thread, I don't think this behaviour matters. The key point here
> > >>>> is that once wait_event() is called from idle_sync_thread(), this
> > >>>> wait_event() should not wait for new sync_thread...
> > >>>
> > >>> I think we should wait. If we don't wait for it, there is a problem.
> > >>> One person echos idle to sync_action and it doesn't work sometimes.
> > >>> It's a strange thing.
> > >>>
> > >>
> > >> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> > >> for new sync_thread that is started after wait_event().
> > >
> > > I suggest removing this function. Without this change, it's more
> > > simple and it can work well without problem. The people that echo idle
> > > to sync_action needs to wait until the sync action finishes. The code
> > > semantic is clear and simple.
> > >>
> > >>>>
> > >>>>> echo resync > sync_action (new sync)
> > >>>>
> > >>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
> > >>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> > >>>
> > >>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> > >>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
> > >>
> > >> This is not accurate, if `echo resync > sync_action` triggers a new
> > >> sync_thread, then sync_seq is updated when this sync_thread is done,
> > >> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
> > >>   >sync_action` will wait for sync_thread to be done.
> > >
> > > I can understand your comment, but sorry, I still can't get how
> > > sync_seq works. Could you give a specific case that explains how it
> > > works?
> >
> > Ok, the problem is that echo ilde is supposed to interrupt sync_thread
> > and stop sync_thread quickly. Now that we don't hold mutex here, we
> > can't prevent new sync_thread to start. For exapmle:
> >
> > 1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;
> >
> > 2) echo idle, A will be interrupted, mutex is not hold and
> > idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.
> >
> > 3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
> > idle_sync_thread(), however, before idle_sync_thread() is woken, A can
> > be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
> > will be set again.
> >
> > 4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
> > set and it will still waiting. And this time B won't be interrupted.
>
> Thanks for the example. I can understand the usage of it. It's the
> side effect that removes the mutex protection for idle_sync_thread.
>
> There is a problem. New sync thread is started in md_check_recovery.
> After your patch, md_reap_sync_thread is called in md_check_recovery
> too. So it looks like they can't happen at the same time?

After thinking a while, there is still a race possibility.

md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
md_check_recovery) and md_check_recovery returns. Before
idle_sync_thread is woken, the new sync thread can be started in
md_check_recovery again.

But it's really strange, when one people echo idle to sync_action.
It's better to add some messages to notify the users that they need to
echo idle to sync_action again to have a try. Is there a way that
md_reap_sync_thread can wait idle_sync_thread?

Regards
Xiao
>
> Regards
> Xiao
>
> >
> > Thanks,
> > Kuai
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Yu Kuai 2 years, 8 months ago

Hi,

在 2023/06/15 16:17, Xiao Ni 写道:
>> Thanks for the example. I can understand the usage of it. It's the
>> side effect that removes the mutex protection for idle_sync_thread.
>>
>> There is a problem. New sync thread is started in md_check_recovery.
>> After your patch, md_reap_sync_thread is called in md_check_recovery
>> too. So it looks like they can't happen at the same time?

Of course they can't. md_check_recovery() can only do one thing at a
time.

> 
> After thinking a while, there is still a race possibility.
> 
> md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
> md_check_recovery) and md_check_recovery returns. Before
> idle_sync_thread is woken, the new sync thread can be started in
> md_check_recovery again.
> 
> But it's really strange, when one people echo idle to sync_action.
> It's better to add some messages to notify the users that they need to
> echo idle to sync_action again to have a try. Is there a way that
> md_reap_sync_thread can wait idle_sync_thread?

I don't think this is a problem, echo idle only make sure to interupt
current sync_thread, there is no gurantee that sync_thread is not
running after "echo idle" is done with or without this patchset, before
this patchset, new sync thread can still start after the mutex is
released.

User shoud "echo forzen" instead of "echo idle" if they really what to
avoid new sync_thread to start.

Thanks,
Kuai
> 
> Regards
> Xiao
>>
>> Regards
>> Xiao
>>
>>>
>>> Thanks,
>>> Kuai
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://listman.redhat.com/mailman/listinfo/dm-devel
> 
> .
>

Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Posted by Xiao Ni 2 years, 8 months ago

On Thu, Jun 15, 2023 at 5:05 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/06/15 16:17, Xiao Ni 写道:
> >> Thanks for the example. I can understand the usage of it. It's the
> >> side effect that removes the mutex protection for idle_sync_thread.
> >>
> >> There is a problem. New sync thread is started in md_check_recovery.
> >> After your patch, md_reap_sync_thread is called in md_check_recovery
> >> too. So it looks like they can't happen at the same time?
>
> Of course they can't. md_check_recovery() can only do one thing at a
> time.
>
> >
> > After thinking a while, there is still a race possibility.
> >
> > md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
> > md_check_recovery) and md_check_recovery returns. Before
> > idle_sync_thread is woken, the new sync thread can be started in
> > md_check_recovery again.
> >
> > But it's really strange, when one people echo idle to sync_action.
> > It's better to add some messages to notify the users that they need to
> > echo idle to sync_action again to have a try. Is there a way that
> > md_reap_sync_thread can wait idle_sync_thread?
>
> I don't think this is a problem, echo idle only make sure to interupt
> current sync_thread, there is no gurantee that sync_thread is not
> running after "echo idle" is done with or without this patchset, before
> this patchset, new sync thread can still start after the mutex is
> released.
>
> User shoud "echo forzen" instead of "echo idle" if they really what to
> avoid new sync_thread to start.

Thanks for all the explanations and patience.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >>
> >> Regards
> >> Xiao
> >>
> >>>
> >>> Thanks,
> >>> Kuai
> >>>
> >>> --
> >>> dm-devel mailing list
> >>> dm-devel@redhat.com
> >>> https://listman.redhat.com/mailman/listinfo/dm-devel
> >
> > .
> >
>

[PATCH -next v2 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"
[PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'
[PATCH -next v2 3/6] md: add a mutex to synchronize idle and frozen in action_store()
[PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock
[PATCH -next v2 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread()
[PATCH -next v2 6/6] md: enhance checking in md_check_recovery()