md/raid5: Fix a deadlock of reshape and suspend

[PATCH] md/raid5: Fix a deadlock of reshape and suspend

Posted by linan666@huaweicloud.com 2 months, 2 weeks ago

From: Li Nan <linan122@huawei.com>

Commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape is
interrupted") fixed a raid deadlock of reshape, but a similar issue is hit
by mdadm test 25raid456-reshape-deadlock.

  INFO: task (udev-worker):63822 blocked for more than 122 seconds.
        Not tainted 6.18.0-rc2-g0555b5424915-dirty #153
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  __schedule
  schedule
  schedule_timeout
  wait_woken
  raid5_make_request
  md_handle_request
  md_submit_bio
  [...]
  blkdev_read_iter
  vfs_read
  ksys_read
  __x64_sys_read

It is triggered by:
1) normal IO waits for reshape to progress
2) user sets ACTION_FROZEN via ioctl
3) reshape is interrupted and cannot restart
4) users try to suspend array while active IO waits reshape

Following Kuai's previous fix, such IOs should fail in
make_stripe_request(). Thus, set a timeout for wait_woken() to fix
the deadlock, and blocked IO will fail in the next cycle.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 drivers/md/raid5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cdbc7eba5c54..957e712d2be9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6185,7 +6185,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			}
 
 			wait_woken(&wait, TASK_UNINTERRUPTIBLE,
-				   MAX_SCHEDULE_TIMEOUT);
+				   msecs_to_jiffies(10000));
 			continue;
 		}
 
-- 
2.39.2

Re: [PATCH] md/raid5: Fix a deadlock of reshape and suspend

Posted by Yu Kuai 1 month, 2 weeks ago

Hi,

在 2025/11/24 16:45, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
>
> Commit 868bba54a3bc ("md/raid5: fix a deadlock in the case that reshape is
> interrupted") fixed a raid deadlock of reshape, but a similar issue is hit
> by mdadm test 25raid456-reshape-deadlock.
>
>    INFO: task (udev-worker):63822 blocked for more than 122 seconds.
>          Not tainted 6.18.0-rc2-g0555b5424915-dirty #153
>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>    __schedule
>    schedule
>    schedule_timeout
>    wait_woken
>    raid5_make_request
>    md_handle_request
>    md_submit_bio
>    [...]
>    blkdev_read_iter
>    vfs_read
>    ksys_read
>    __x64_sys_read
>
> It is triggered by:
> 1) normal IO waits for reshape to progress
> 2) user sets ACTION_FROZEN via ioctl
> 3) reshape is interrupted and cannot restart
> 4) users try to suspend array while active IO waits reshape
>
> Following Kuai's previous fix, such IOs should fail in
> make_stripe_request(). Thus, set a timeout for wait_woken() to fix
> the deadlock, and blocked IO will fail in the next cycle.
>
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   drivers/md/raid5.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cdbc7eba5c54..957e712d2be9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6185,7 +6185,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   			}
>   
>   			wait_woken(&wait, TASK_UNINTERRUPTIBLE,
> -				   MAX_SCHEDULE_TIMEOUT);
> +				   msecs_to_jiffies(10000));

Instead of this change to wake up every 10s unconditionally, can you fix this by wake up
synchronously when array is frozen or suspended that reshape can't continue.

>   			continue;
>   		}
>   

-- 
Thansk,
Kuai