[v1] md: fix is_mddev_idle()

[PATCH 3/4] md: fix is_mddev_idle()

Posted by Yu Kuai 10 months ago

From: Yu Kuai <yukuai3@huawei.com>

If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 78 ++++++++++++++++++++++++++-----------------------
 drivers/md/md.h |  3 +-
 2 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8966c4afc62a..19da93f8912c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8619,50 +8619,54 @@ void md_cluster_stop(struct mddev *mddev)
 	put_cluster_ops(mddev);
 }
 
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_idle(struct md_rdev *rdev, bool init)
+{
+	unsigned long last_events = rdev->last_events;
+
+	if (!bdev_is_partition(rdev->bdev))
+		return true;
+
+	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
+						 sectors) -
+			    part_stat_read_accum(rdev->bdev, sectors);
+
+	if (!init && rdev->last_events > last_events)
+		return false;
+
+	return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions doesn't have IO
+ *    completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
 {
 	struct md_rdev *rdev;
-	int idle;
-	int curr_events;
+	bool idle = true;
 
-	idle = 1;
-	rcu_read_lock();
-	rdev_for_each_rcu(rdev, mddev) {
-		struct gendisk *disk = rdev->bdev->bd_disk;
+	if (!mddev_is_dm(mddev)) {
+		unsigned long last_events = mddev->last_events;
 
-		if (!init && !blk_queue_io_stat(disk->queue))
-			continue;
+		mddev->last_events = part_stat_read_accum(mddev->gendisk->part0,
+							  sectors);
 
-		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
-			      atomic_read(&disk->sync_io);
-		/* sync IO will cause sync_io to increase before the disk_stats
-		 * as sync_io is counted when a request starts, and
-		 * disk_stats is counted when it completes.
-		 * So resync activity will cause curr_events to be smaller than
-		 * when there was no such activity.
-		 * non-sync IO will cause disk_stat to increase without
-		 * increasing sync_io so curr_events will (eventually)
-		 * be larger than it was before.  Once it becomes
-		 * substantially larger, the test below will cause
-		 * the array to appear non-idle, and resync will slow
-		 * down.
-		 * If there is a lot of outstanding resync activity when
-		 * we set last_event to curr_events, then all that activity
-		 * completing might cause the array to appear non-idle
-		 * and resync will be slowed down even though there might
-		 * not have been non-resync activity.  This will only
-		 * happen once though.  'last_events' will soon reflect
-		 * the state where there is little or no outstanding
-		 * resync requests, and further resync activity will
-		 * always make curr_events less than last_events.
-		 *
-		 */
-		if (init || curr_events - rdev->last_events > 64) {
-			rdev->last_events = curr_events;
-			idle = 0;
-		}
+		if (!init && (mddev->last_events > last_events ||
+			      part_in_flight(mddev->gendisk->part0)))
+			idle = false;
 	}
+
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev)
+		if (!is_rdev_idle(rdev, init))
+			idle = false;
 	rcu_read_unlock();
+
 	return idle;
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 63be622467c6..95cf11c4abc6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,7 +132,7 @@ struct md_rdev {
 
 	sector_t sectors;		/* Device size (in 512bytes sectors) */
 	struct mddev *mddev;		/* RAID array if running */
-	int last_events;		/* IO event timestamp */
+	unsigned long last_events;	/* IO event timestamp */
 
 	/*
 	 * If meta_bdev is non-NULL, it means that a separate device is
@@ -519,6 +519,7 @@ struct mddev {
 							 * adding a spare
 							 */
 
+	unsigned long			last_events;	/* IO event timestamp */
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
-- 
2.39.2

Re: [PATCH 3/4] md: fix is_mddev_idle()

Posted by Xiao Ni 9 months, 4 weeks ago

在 2025/4/12 下午3:32, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
>
> If sync_speed is above speed_min, then is_mddev_idle() will be called
> for each sync IO to check if the array is idle, and inflihgt sync_io
> will be limited if the array is not idle.
>
> However, while mkfs.ext4 for a large raid5 array while recovery is in
> progress, it's found that sync_speed is already above speed_min while
> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
>
> Root cause is the following checking from is_mddev_idle():
>
> t1: submit sync IO: events1 = completed IO - issued sync IO
> t2: submit next sync IO: events2  = completed IO - issued sync IO
> if (events2 - events1 > 64)
>
> For consequence, the more sync IO issued, the less likely checking will
> pass. And when completed normal IO is more than issued sync IO, the
> condition will finally pass and is_mddev_idle() will return false,
> however, last_events will be updated hence is_mddev_idle() can only
> return false once in a while.
>
> Fix this problem by changing the checking as following:
>
> 1) mddev doesn't have normal IO completed;
> 2) mddev doesn't have normal IO inflight;
> 3) if any member disks is partition, and all other partitions doesn't
>     have IO completed.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>   drivers/md/md.c | 78 ++++++++++++++++++++++++++-----------------------
>   drivers/md/md.h |  3 +-
>   2 files changed, 43 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 8966c4afc62a..19da93f8912c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8619,50 +8619,54 @@ void md_cluster_stop(struct mddev *mddev)
>   	put_cluster_ops(mddev);
>   }
>   
> -static int is_mddev_idle(struct mddev *mddev, int init)
> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
> +{
> +	unsigned long last_events = rdev->last_events;
> +
> +	if (!bdev_is_partition(rdev->bdev))
> +		return true;


For md array, I think is_rdev_idle is not useful. Because 
mddev->last_events must be increased while upper ios come in and idle 
will be set to false. For dm array, mddev->last_events can't work. So 
is_rdev_idle is for dm array. If member disk is one partition, 
is_rdev_idle alwasy returns true, and is_mddev_idle always return true. 
It's a bug here. Do we need to check bdev_is_partition here?

Best Regards

Xiao

> +
> +	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
> +						 sectors) -
> +			    part_stat_read_accum(rdev->bdev, sectors);
> +
> +	if (!init && rdev->last_events > last_events)
> +
> +	return true;
> +}
> +
> +/*
> + * mddev is idle if following conditions are match since last check:
> + * 1) mddev doesn't have normal IO completed;
> + * 2) mddev doesn't have inflight normal IO;
> + * 3) if any member disk is partition, and other partitions doesn't have IO
> + *    completed;
> + *
> + * Noted this checking rely on IO accounting is enabled.
> + */
> +static bool is_mddev_idle(struct mddev *mddev, int init)
>   {
>   	struct md_rdev *rdev;
> -	int idle;
> -	int curr_events;
> +	bool idle = true;
>   
> -	idle = 1;
> -	rcu_read_lock();
> -	rdev_for_each_rcu(rdev, mddev) {
> -		struct gendisk *disk = rdev->bdev->bd_disk;
> +	if (!mddev_is_dm(mddev)) {
> +		unsigned long last_events = mddev->last_events;
>   
> -		if (!init && !blk_queue_io_stat(disk->queue))
> -			continue;
> +		mddev->last_events = part_stat_read_accum(mddev->gendisk->part0,
> +							  sectors);
>   
> -		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
> -			      atomic_read(&disk->sync_io);
> -		/* sync IO will cause sync_io to increase before the disk_stats
> -		 * as sync_io is counted when a request starts, and
> -		 * disk_stats is counted when it completes.
> -		 * So resync activity will cause curr_events to be smaller than
> -		 * when there was no such activity.
> -		 * non-sync IO will cause disk_stat to increase without
> -		 * increasing sync_io so curr_events will (eventually)
> -		 * be larger than it was before.  Once it becomes
> -		 * substantially larger, the test below will cause
> -		 * the array to appear non-idle, and resync will slow
> -		 * down.
> -		 * If there is a lot of outstanding resync activity when
> -		 * we set last_event to curr_events, then all that activity
> -		 * completing might cause the array to appear non-idle
> -		 * and resync will be slowed down even though there might
> -		 * not have been non-resync activity.  This will only
> -		 * happen once though.  'last_events' will soon reflect
> -		 * the state where there is little or no outstanding
> -		 * resync requests, and further resync activity will
> -		 * always make curr_events less than last_events.
> -		 *
> -		 */
> -		if (init || curr_events - rdev->last_events > 64) {
> -			rdev->last_events = curr_events;
> -			idle = 0;
> -		}
> +		if (!init && (mddev->last_events > last_events ||
> +			      part_in_flight(mddev->gendisk->part0)))
> +			idle = false;
>   	}
> +
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, mddev)
> +		if (!is_rdev_idle(rdev, init))
> +			idle = false;
>   	rcu_read_unlock();
> +
>   	return idle;
>   }
>   
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 63be622467c6..95cf11c4abc6 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -132,7 +132,7 @@ struct md_rdev {
>   
>   	sector_t sectors;		/* Device size (in 512bytes sectors) */
>   	struct mddev *mddev;		/* RAID array if running */
> -	int last_events;		/* IO event timestamp */
> +	unsigned long last_events;	/* IO event timestamp */
>   
>   	/*
>   	 * If meta_bdev is non-NULL, it means that a separate device is
> @@ -519,6 +519,7 @@ struct mddev {
>   							 * adding a spare
>   							 */
>   
> +	unsigned long			last_events;	/* IO event timestamp */
>   	atomic_t			recovery_active; /* blocks scheduled, but not written */
>   	wait_queue_head_t		recovery_wait;
>   	sector_t			recovery_cp;

Re: [PATCH 3/4] md: fix is_mddev_idle()

Posted by Yu Kuai 9 months, 4 weeks ago

Hi,

在 2025/04/16 14:20, Xiao Ni 写道:
>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>> +{
>> +    unsigned long last_events = rdev->last_events;
>> +
>> +    if (!bdev_is_partition(rdev->bdev))
>> +        return true;
> 
> 
> For md array, I think is_rdev_idle is not useful. Because 
> mddev->last_events must be increased while upper ios come in and idle 
> will be set to false. For dm array, mddev->last_events can't work. So 
> is_rdev_idle is for dm array. If member disk is one partition, 
> is_rdev_idle alwasy returns true, and is_mddev_idle always return true. 
> It's a bug here. Do we need to check bdev_is_partition here?

is_rdev_idle() is not used for current array, for example:

sda1 is used for array md0, and user doesn't issue IO to md0, while
user issues IO to sda2. In this case, is_mddev_idle() still fail for
array md0 because is_rdev_idle() fail.

This is just inherited from the old behaviour.

Thanks,
Kuai

> 
> Best Regards
> 
> Xiao

Re: [PATCH 3/4] md: fix is_mddev_idle()

Posted by Yu Kuai 9 months, 4 weeks ago

Hi,

在 2025/04/16 15:42, Yu Kuai 写道:
> Hi,
> 
> 在 2025/04/16 14:20, Xiao Ni 写道:
>>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>>> +{
>>> +    unsigned long last_events = rdev->last_events;
>>> +
>>> +    if (!bdev_is_partition(rdev->bdev))
>>> +        return true;
>>
>>
>> For md array, I think is_rdev_idle is not useful. Because 
>> mddev->last_events must be increased while upper ios come in and idle 
>> will be set to false. For dm array, mddev->last_events can't work. So 
>> is_rdev_idle is for dm array. If member disk is one partition, 
>> is_rdev_idle alwasy returns true, and is_mddev_idle always return 
>> true. It's a bug here. Do we need to check bdev_is_partition here?
> 
> is_rdev_idle() is not used for current array, for example:
> 
> sda1 is used for array md0, and user doesn't issue IO to md0, while
> user issues IO to sda2. In this case, is_mddev_idle() still fail for
> array md0 because is_rdev_idle() fail.

Perhaps the name is_rdev_holder_idle() is better.

Thanks,
Kuai

> 
> This is just inherited from the old behaviour.
> 
> Thanks,
> Kuai
> 
>>
>> Best Regards
>>
>> Xiao
> 
> .
>

Re: [PATCH 3/4] md: fix is_mddev_idle()

Posted by Xiao Ni 9 months, 4 weeks ago

On Wed, Apr 16, 2025 at 5:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2025/04/16 15:42, Yu Kuai 写道:
> > Hi,
> >
> > 在 2025/04/16 14:20, Xiao Ni 写道:
> >>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
> >>> +{
> >>> +    unsigned long last_events = rdev->last_events;
> >>> +
> >>> +    if (!bdev_is_partition(rdev->bdev))
> >>> +        return true;
> >>
> >>
> >> For md array, I think is_rdev_idle is not useful. Because
> >> mddev->last_events must be increased while upper ios come in and idle
> >> will be set to false. For dm array, mddev->last_events can't work. So
> >> is_rdev_idle is for dm array. If member disk is one partition,
> >> is_rdev_idle alwasy returns true, and is_mddev_idle always return
> >> true. It's a bug here. Do we need to check bdev_is_partition here?
> >
> > is_rdev_idle() is not used for current array, for example:
> >
> > sda1 is used for array md0, and user doesn't issue IO to md0, while
> > user issues IO to sda2. In this case, is_mddev_idle() still fail for
> > array md0 because is_rdev_idle() fail.

Thanks very much for the explanation. It makes sense :)

>
> Perhaps the name is_rdev_holder_idle() is better.

Your suggestion is better. And it's better to add some comments before
this function.

But how about dm-raid? Can this patch work for dm-raid?

Regards
Xiao

>
> Thanks,
> Kuai
>
> >
> > This is just inherited from the old behaviour.
> >
> > Thanks,
> > Kuai
> >
> >>
> >> Best Regards
> >>
> >> Xiao
> >
> > .
> >
>

Re: [PATCH 3/4] md: fix is_mddev_idle()

Posted by Yu Kuai 9 months, 4 weeks ago

Hi,

在 2025/04/16 17:44, Xiao Ni 写道:
> On Wed, Apr 16, 2025 at 5:29 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/04/16 15:42, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2025/04/16 14:20, Xiao Ni 写道:
>>>>> +static bool is_rdev_idle(struct md_rdev *rdev, bool init)
>>>>> +{
>>>>> +    unsigned long last_events = rdev->last_events;
>>>>> +
>>>>> +    if (!bdev_is_partition(rdev->bdev))
>>>>> +        return true;
>>>>
>>>>
>>>> For md array, I think is_rdev_idle is not useful. Because
>>>> mddev->last_events must be increased while upper ios come in and idle
>>>> will be set to false. For dm array, mddev->last_events can't work. So
>>>> is_rdev_idle is for dm array. If member disk is one partition,
>>>> is_rdev_idle alwasy returns true, and is_mddev_idle always return
>>>> true. It's a bug here. Do we need to check bdev_is_partition here?
>>>
>>> is_rdev_idle() is not used for current array, for example:
>>>
>>> sda1 is used for array md0, and user doesn't issue IO to md0, while
>>> user issues IO to sda2. In this case, is_mddev_idle() still fail for
>>> array md0 because is_rdev_idle() fail.
> 
> Thanks very much for the explanation. It makes sense :)
> 
>>
>> Perhaps the name is_rdev_holder_idle() is better.
> 
> Your suggestion is better. And it's better to add some comments before
> this function.
> 
> But how about dm-raid? Can this patch work for dm-raid?

is_rdev_holder_idle() can work for dm-raid, however, the part to
check if normal IO is inflight or completed can't work for dm-raid,
currently there is no way to grab dm gendisk from mddev. However, I
think there won't be regression since the old buggy is_mddev_idle()
almost always return false.

Thanks,
Kuai

> 
> Regards
> Xiao
> 
>>
>> Thanks,
>> Kuai
>>
>>>
>>> This is just inherited from the old behaviour.
>>>
>>> Thanks,
>>> Kuai
>>>
>>>>
>>>> Best Regards
>>>>
>>>> Xiao
>>>
>>> .
>>>
>>
> 
> 
> .
>