[v1] md/raid1: refactor read_balance() and some minor fix

[PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Yu Kuai 1 year, 11 months ago

From: Yu Kuai <yukuai3@huawei.com>

Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
the case choose next idle in read_balance():

read_balance:
 for_each_rdev
  if(next_seq_sect == this_sector || disk == 0)
  -> sequential reads
   best_disk = disk;
   if (...)
    choose_next_idle = 1
    continue;

 for_each_rdev
 -> iterate next rdev
  if (pending == 0)
   best_disk = disk;
   -> choose the next idle disk
   break;

  if (choose_next_idle)
   -> keep using this rdev if there are no other idle disk
   contine

However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
remove the code:

-               /* If device is idle, use it */
-               if (pending == 0) {
-                       best_disk = disk;
-                       break;
-               }

Hence choose next idle will never work now, fix this problem by
following:

1) don't set best_disk in this case, read_balance() will choose the best
   disk after iterating all the disks;
2) add 'pending' so that other idle disk will be chosen;
3) set 'dist' to 0 so that if there is no other idle disk, and all disks
   are rotational, this disk will still be chosen;

Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid1.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c60ea58ae8c5..d0bc67e6d068 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	unsigned int min_pending;
 	struct md_rdev *rdev;
 	int choose_first;
-	int choose_next_idle;
 
 	/*
 	 * Check if we can balance. We can balance on the whole
@@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	best_pending_disk = -1;
 	min_pending = UINT_MAX;
 	best_good_sectors = 0;
-	choose_next_idle = 0;
 	clear_bit(R1BIO_FailFast, &r1_bio->state);
 
 	if ((conf->mddev->recovery_cp < this_sector + sectors) ||
@@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 			int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
 			struct raid1_info *mirror = &conf->mirrors[disk];
 
-			best_disk = disk;
 			/*
 			 * If buffered sequential IO size exceeds optimal
 			 * iosize, check if there is idle disk. If yes, choose
@@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 			    mirror->next_seq_sect > opt_iosize &&
 			    mirror->next_seq_sect - opt_iosize >=
 			    mirror->seq_start) {
-				choose_next_idle = 1;
-				continue;
+				/*
+				 * Add 'pending' to avoid choosing this disk if
+				 * there is other idle disk.
+				 * Set 'dist' to 0, so that if there is no other
+				 * idle disk and all disks are rotational, this
+				 * disk will still be chosen.
+				 */
+				pending++;
+				dist = 0;
+			} else {
+				best_disk = disk;
+				break;
 			}
-			break;
 		}
 
-		if (choose_next_idle)
-			continue;
-
 		if (min_pending > pending) {
 			min_pending = pending;
 			best_pending_disk = disk;
-- 
2.39.2

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Xiao Ni 1 year, 11 months ago

On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
>  for_each_rdev
>   if(next_seq_sect == this_sector || disk == 0)
>   -> sequential reads
>    best_disk = disk;
>    if (...)
>     choose_next_idle = 1
>     continue;
>
>  for_each_rdev
>  -> iterate next rdev
>   if (pending == 0)
>    best_disk = disk;
>    -> choose the next idle disk
>    break;
>
>   if (choose_next_idle)
>    -> keep using this rdev if there are no other idle disk
>    contine
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> -               /* If device is idle, use it */
> -               if (pending == 0) {
> -                       best_disk = disk;
> -                       break;
> -               }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
>    disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>    are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/raid1.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
> -       int choose_next_idle;
>
>         /*
>          * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         best_pending_disk = -1;
>         min_pending = UINT_MAX;
>         best_good_sectors = 0;
> -       choose_next_idle = 0;
>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>
>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>                         struct raid1_info *mirror = &conf->mirrors[disk];
>
> -                       best_disk = disk;
>                         /*
>                          * If buffered sequential IO size exceeds optimal
>                          * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                             mirror->next_seq_sect > opt_iosize &&
>                             mirror->next_seq_sect - opt_iosize >=
>                             mirror->seq_start) {
> -                               choose_next_idle = 1;
> -                               continue;
> +                               /*
> +                                * Add 'pending' to avoid choosing this disk if
> +                                * there is other idle disk.
> +                                * Set 'dist' to 0, so that if there is no other
> +                                * idle disk and all disks are rotational, this
> +                                * disk will still be chosen.
> +                                */
> +                               pending++;
> +                               dist = 0;
> +                       } else {
> +                               best_disk = disk;
> +                               break;
>                         }
> -                       break;
>                 }

Hi Kuai

I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
a sequential read. If there are no other idle disks, it will read from
the sequential disk. With this patch, it reads from the
best_pending_disk even min_pending is not 0. It looks like a wrong
behaviour?

Best Regards
Xiao
>
> -               if (choose_next_idle)
> -                       continue;
> -
>                 if (min_pending > pending) {
>                         min_pending = pending;
>                         best_pending_disk = disk;
> --
> 2.39.2
>
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Yu Kuai 1 year, 11 months ago

Hi,

在 2024/02/27 10:23, Xiao Ni 写道:
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>> the case choose next idle in read_balance():
>>
>> read_balance:
>>   for_each_rdev
>>    if(next_seq_sect == this_sector || disk == 0)
>>    -> sequential reads
>>     best_disk = disk;
>>     if (...)
>>      choose_next_idle = 1
>>      continue;
>>
>>   for_each_rdev
>>   -> iterate next rdev
>>    if (pending == 0)
>>     best_disk = disk;
>>     -> choose the next idle disk
>>     break;
>>
>>    if (choose_next_idle)
>>     -> keep using this rdev if there are no other idle disk
>>     contine
>>
>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> remove the code:
>>
>> -               /* If device is idle, use it */
>> -               if (pending == 0) {
>> -                       best_disk = disk;
>> -                       break;
>> -               }
>>
>> Hence choose next idle will never work now, fix this problem by
>> following:
>>
>> 1) don't set best_disk in this case, read_balance() will choose the best
>>     disk after iterating all the disks;
>> 2) add 'pending' so that other idle disk will be chosen;
>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>     are rotational, this disk will still be chosen;
>>
>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/raid1.c | 21 ++++++++++++---------
>>   1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c60ea58ae8c5..d0bc67e6d068 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>          unsigned int min_pending;
>>          struct md_rdev *rdev;
>>          int choose_first;
>> -       int choose_next_idle;
>>
>>          /*
>>           * Check if we can balance. We can balance on the whole
>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>          best_pending_disk = -1;
>>          min_pending = UINT_MAX;
>>          best_good_sectors = 0;
>> -       choose_next_idle = 0;
>>          clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>                          struct raid1_info *mirror = &conf->mirrors[disk];
>>
>> -                       best_disk = disk;
>>                          /*
>>                           * If buffered sequential IO size exceeds optimal
>>                           * iosize, check if there is idle disk. If yes, choose
>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>                              mirror->next_seq_sect > opt_iosize &&
>>                              mirror->next_seq_sect - opt_iosize >=
>>                              mirror->seq_start) {
>> -                               choose_next_idle = 1;
>> -                               continue;
>> +                               /*
>> +                                * Add 'pending' to avoid choosing this disk if
>> +                                * there is other idle disk.
>> +                                * Set 'dist' to 0, so that if there is no other
>> +                                * idle disk and all disks are rotational, this
>> +                                * disk will still be chosen.
>> +                                */
>> +                               pending++;
>> +                               dist = 0;
>> +                       } else {
>> +                               best_disk = disk;
>> +                               break;
>>                          }
>> -                       break;
>>                  }
> 
> Hi Kuai
> 
> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
> a sequential read. If there are no other idle disks, it will read from
> the sequential disk. With this patch, it reads from the
> best_pending_disk even min_pending is not 0. It looks like a wrong
> behaviour?

Yes, nice catch, I didn't notice this yet... So there is a hidden
logical, sequential IO priority is higher than minimal 'pending'
selection, it's only less than 'choose_next_idle' where idle disk
exist.

Looks like if we want to keep this behaviour, we can add a 'sequential
disk':

if (is_sequential())
  if (!should_choose_next())
   return disk;
  ctl.sequential_disk = disk;

...

if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
  return ctl.sequential_disk;

Thanks,
Kuai

> 
> Best Regards
> Xiao
>>
>> -               if (choose_next_idle)
>> -                       continue;
>> -
>>                  if (min_pending > pending) {
>>                          min_pending = pending;
>>                          best_pending_disk = disk;
>> --
>> 2.39.2
>>
>>
> 
> .
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Xiao Ni 1 year, 11 months ago

On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/27 10:23, Xiao Ni 写道:
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >>   for_each_rdev
> >>    if(next_seq_sect == this_sector || disk == 0)
> >>    -> sequential reads
> >>     best_disk = disk;
> >>     if (...)
> >>      choose_next_idle = 1
> >>      continue;
> >>
> >>   for_each_rdev
> >>   -> iterate next rdev
> >>    if (pending == 0)
> >>     best_disk = disk;
> >>     -> choose the next idle disk
> >>     break;
> >>
> >>    if (choose_next_idle)
> >>     -> keep using this rdev if there are no other idle disk
> >>     contine
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> -               /* If device is idle, use it */
> >> -               if (pending == 0) {
> >> -                       best_disk = disk;
> >> -                       break;
> >> -               }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >>     disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>     are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >> ---
> >>   drivers/md/raid1.c | 21 ++++++++++++---------
> >>   1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          unsigned int min_pending;
> >>          struct md_rdev *rdev;
> >>          int choose_first;
> >> -       int choose_next_idle;
> >>
> >>          /*
> >>           * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          best_pending_disk = -1;
> >>          min_pending = UINT_MAX;
> >>          best_good_sectors = 0;
> >> -       choose_next_idle = 0;
> >>          clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>                          struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> -                       best_disk = disk;
> >>                          /*
> >>                           * If buffered sequential IO size exceeds optimal
> >>                           * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                              mirror->next_seq_sect > opt_iosize &&
> >>                              mirror->next_seq_sect - opt_iosize >=
> >>                              mirror->seq_start) {
> >> -                               choose_next_idle = 1;
> >> -                               continue;
> >> +                               /*
> >> +                                * Add 'pending' to avoid choosing this disk if
> >> +                                * there is other idle disk.
> >> +                                * Set 'dist' to 0, so that if there is no other
> >> +                                * idle disk and all disks are rotational, this
> >> +                                * disk will still be chosen.
> >> +                                */
> >> +                               pending++;
> >> +                               dist = 0;
> >> +                       } else {
> >> +                               best_disk = disk;
> >> +                               break;
> >>                          }
> >> -                       break;
> >>                  }
> >
> > Hi Kuai
> >
> > I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
> > a sequential read. If there are no other idle disks, it will read from
> > the sequential disk. With this patch, it reads from the
> > best_pending_disk even min_pending is not 0. It looks like a wrong
> > behaviour?
>
> Yes, nice catch, I didn't notice this yet... So there is a hidden
> logical, sequential IO priority is higher than minimal 'pending'
> selection, it's only less than 'choose_next_idle' where idle disk
> exist.

Yes.


>
> Looks like if we want to keep this behaviour, we can add a 'sequential
> disk':
>
> if (is_sequential())
>   if (!should_choose_next())
>    return disk;
>   ctl.sequential_disk = disk;
>
> ...
>
> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>   return ctl.sequential_disk;

Agree with this, thanks :)

Best Regards
Xiao
>
> Thanks,
> Kuai
>
> >
> > Best Regards
> > Xiao
> >>
> >> -               if (choose_next_idle)
> >> -                       continue;
> >> -
> >>                  if (min_pending > pending) {
> >>                          min_pending = pending;
> >>                          best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Luse, Paul E 1 year, 11 months ago


> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@redhat.com> wrote:
> 
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>> 
>> Hi,
>> 
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>> 
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>> 
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>> 
>>>> read_balance:
>>>>  for_each_rdev
>>>>   if(next_seq_sect == this_sector || disk == 0)
>>>>   -> sequential reads
>>>>    best_disk = disk;
>>>>    if (...)
>>>>     choose_next_idle = 1
>>>>     continue;
>>>> 
>>>>  for_each_rdev
>>>>  -> iterate next rdev
>>>>   if (pending == 0)
>>>>    best_disk = disk;
>>>>    -> choose the next idle disk
>>>>    break;
>>>> 
>>>>   if (choose_next_idle)
>>>>    -> keep using this rdev if there are no other idle disk
>>>>    contine
>>>> 
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>> 
>>>> -               /* If device is idle, use it */
>>>> -               if (pending == 0) {
>>>> -                       best_disk = disk;
>>>> -                       break;
>>>> -               }
>>>> 
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>> 
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>>    disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>>    are rotational, this disk will still be chosen;
>>>> 
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>  drivers/md/raid1.c | 21 ++++++++++++---------
>>>>  1 file changed, 12 insertions(+), 9 deletions(-)
>>>> 
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         unsigned int min_pending;
>>>>         struct md_rdev *rdev;
>>>>         int choose_first;
>>>> -       int choose_next_idle;
>>>> 
>>>>         /*
>>>>          * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         best_pending_disk = -1;
>>>>         min_pending = UINT_MAX;
>>>>         best_good_sectors = 0;
>>>> -       choose_next_idle = 0;
>>>>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>> 
>>>>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>>                         struct raid1_info *mirror = &conf->mirrors[disk];
>>>> 
>>>> -                       best_disk = disk;
>>>>                         /*
>>>>                          * If buffered sequential IO size exceeds optimal
>>>>                          * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                             mirror->next_seq_sect > opt_iosize &&
>>>>                             mirror->next_seq_sect - opt_iosize >=
>>>>                             mirror->seq_start) {
>>>> -                               choose_next_idle = 1;
>>>> -                               continue;
>>>> +                               /*
>>>> +                                * Add 'pending' to avoid choosing this disk if
>>>> +                                * there is other idle disk.
>>>> +                                * Set 'dist' to 0, so that if there is no other
>>>> +                                * idle disk and all disks are rotational, this
>>>> +                                * disk will still be chosen.
>>>> +                                */
>>>> +                               pending++;
>>>> +                               dist = 0;
>>>> +                       } else {
>>>> +                               best_disk = disk;
>>>> +                               break;
>>>>                         }
>>>> -                       break;
>>>>                 }
>>> 
>>> Hi Kuai
>>> 
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>> 
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
> 
> Yes.
> 
> 
>> 
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>> 
>> if (is_sequential())
>>  if (!should_choose_next())
>>   return disk;
>>  ctl.sequential_disk = disk;
>> 
>> ...
>> 
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>>  return ctl.sequential_disk;
> 
> Agree with this, thanks :)
> 
> Best Regards
> Xiao

Yup, agree as well.  This will help for sure with the followup to this series for seq read improvements :) 

>> 
>> Thanks,
>> Kuai
>> 
>>> 
>>> Best Regards
>>> Xiao
>>>> 
>>>> -               if (choose_next_idle)
>>>> -                       continue;
>>>> -
>>>>                 if (min_pending > pending) {
>>>>                         min_pending = pending;
>>>>                         best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>> 
>>>> 
>>> 
>>> .
>>> 
>> 
> 
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Xiao Ni 1 year, 11 months ago

Hi Kuai

Thanks for the effort!

On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> the case choose next idle in read_balance():
>
> read_balance:
>  for_each_rdev
>   if(next_seq_sect == this_sector || disk == 0)

typo error: s/disk/dist/g

>   -> sequential reads
>    best_disk = disk;
>    if (...)
>     choose_next_idle = 1
>     continue;
>
>  for_each_rdev
>  -> iterate next rdev
>   if (pending == 0)
>    best_disk = disk;
>    -> choose the next idle disk
>    break;
>
>   if (choose_next_idle)
>    -> keep using this rdev if there are no other idle disk
>    continue
>
> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> remove the code:
>
> -               /* If device is idle, use it */
> -               if (pending == 0) {
> -                       best_disk = disk;
> -                       break;
> -               }
>
> Hence choose next idle will never work now, fix this problem by
> following:
>
> 1) don't set best_disk in this case, read_balance() will choose the best
>    disk after iterating all the disks;
> 2) add 'pending' so that other idle disk will be chosen;
> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>    are rotational, this disk will still be chosen;
>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/raid1.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c60ea58ae8c5..d0bc67e6d068 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         unsigned int min_pending;
>         struct md_rdev *rdev;
>         int choose_first;
> -       int choose_next_idle;
>
>         /*
>          * Check if we can balance. We can balance on the whole
> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>         best_pending_disk = -1;
>         min_pending = UINT_MAX;
>         best_good_sectors = 0;
> -       choose_next_idle = 0;
>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>
>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>                         struct raid1_info *mirror = &conf->mirrors[disk];
>
> -                       best_disk = disk;
>                         /*
>                          * If buffered sequential IO size exceeds optimal
>                          * iosize, check if there is idle disk. If yes, choose
> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>                             mirror->next_seq_sect > opt_iosize &&
>                             mirror->next_seq_sect - opt_iosize >=
>                             mirror->seq_start) {
> -                               choose_next_idle = 1;
> -                               continue;
> +                               /*
> +                                * Add 'pending' to avoid choosing this disk if
> +                                * there is other idle disk.
> +                                * Set 'dist' to 0, so that if there is no other
> +                                * idle disk and all disks are rotational, this
> +                                * disk will still be chosen.
> +                                */
> +                               pending++;
> +                               dist = 0;

There is a problem. If all disks are not idle and there is a disk with
dist=0 before the seq disk, it can't read from the seq disk. It will
read from the first disk with dist=0. Maybe we can only add the codes
which are removed from 2e52d449bcec?

Best Regards
Xiao

> +                       } else {
> +                               best_disk = disk;
> +                               break;
>                         }
> -                       break;
>                 }
>
> -               if (choose_next_idle)
> -                       continue;
> -
>                 if (min_pending > pending) {
>                         min_pending = pending;
>                         best_pending_disk = disk;
> --
> 2.39.2
>
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Yu Kuai 1 year, 11 months ago

Hi,

在 2024/02/26 16:55, Xiao Ni 写道:
> Hi Kuai
> 
> Thanks for the effort!
> 
> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>> the case choose next idle in read_balance():
>>
>> read_balance:
>>   for_each_rdev
>>    if(next_seq_sect == this_sector || disk == 0)
> 
> typo error: s/disk/dist/g
> 
>>    -> sequential reads
>>     best_disk = disk;
>>     if (...)
>>      choose_next_idle = 1
>>      continue;
>>
>>   for_each_rdev
>>   -> iterate next rdev
>>    if (pending == 0)
>>     best_disk = disk;
>>     -> choose the next idle disk
>>     break;
>>
>>    if (choose_next_idle)
>>     -> keep using this rdev if there are no other idle disk
>>     continue
>>
>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> remove the code:
>>
>> -               /* If device is idle, use it */
>> -               if (pending == 0) {
>> -                       best_disk = disk;
>> -                       break;
>> -               }
>>
>> Hence choose next idle will never work now, fix this problem by
>> following:
>>
>> 1) don't set best_disk in this case, read_balance() will choose the best
>>     disk after iterating all the disks;
>> 2) add 'pending' so that other idle disk will be chosen;
>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>     are rotational, this disk will still be chosen;
>>
>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   drivers/md/raid1.c | 21 ++++++++++++---------
>>   1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c60ea58ae8c5..d0bc67e6d068 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>          unsigned int min_pending;
>>          struct md_rdev *rdev;
>>          int choose_first;
>> -       int choose_next_idle;
>>
>>          /*
>>           * Check if we can balance. We can balance on the whole
>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>          best_pending_disk = -1;
>>          min_pending = UINT_MAX;
>>          best_good_sectors = 0;
>> -       choose_next_idle = 0;
>>          clear_bit(R1BIO_FailFast, &r1_bio->state);
>>
>>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>                          struct raid1_info *mirror = &conf->mirrors[disk];
>>
>> -                       best_disk = disk;
>>                          /*
>>                           * If buffered sequential IO size exceeds optimal
>>                           * iosize, check if there is idle disk. If yes, choose
>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>                              mirror->next_seq_sect > opt_iosize &&
>>                              mirror->next_seq_sect - opt_iosize >=
>>                              mirror->seq_start) {
>> -                               choose_next_idle = 1;
>> -                               continue;
>> +                               /*
>> +                                * Add 'pending' to avoid choosing this disk if
>> +                                * there is other idle disk.
>> +                                * Set 'dist' to 0, so that if there is no other
>> +                                * idle disk and all disks are rotational, this
>> +                                * disk will still be chosen.
>> +                                */
>> +                               pending++;
>> +                               dist = 0;
> 
> There is a problem. If all disks are not idle and there is a disk with
> dist=0 before the seq disk, it can't read from the seq disk. It will
> read from the first disk with dist=0. Maybe we can only add the codes
> which are removed from 2e52d449bcec?

If there is a disk with disk=0, then best_dist_disk will be updated to
the disk, and best_dist will be updated to 0 already:

// in each iteration
if (dist < best_dist) {
	best_dist = dist;
	btest_disk_disk = disk;
}

In this case, best_dist will be set to the first disk with dist=0, and
at last, the disk will be chosen:

if (best_disk == -1) {
         if (has_nonrot_disk || min_pending == 0)
                 best_disk = best_pending_disk;
         else
                 best_disk = best_dist_disk;
		-> the first disk with dist=0;
}

So, the problem that you concerned should not exist.

Thanks,
Kuai
> 
> Best Regards
> Xiao
> 
>> +                       } else {
>> +                               best_disk = disk;
>> +                               break;
>>                          }
>> -                       break;
>>                  }
>>
>> -               if (choose_next_idle)
>> -                       continue;
>> -
>>                  if (min_pending > pending) {
>>                          min_pending = pending;
>>                          best_pending_disk = disk;
>> --
>> 2.39.2
>>
>>
> 
> .
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Xiao Ni 1 year, 11 months ago

On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/26 16:55, Xiao Ni 写道:
> > Hi Kuai
> >
> > Thanks for the effort!
> >
> > On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >> the case choose next idle in read_balance():
> >>
> >> read_balance:
> >>   for_each_rdev
> >>    if(next_seq_sect == this_sector || disk == 0)
> >
> > typo error: s/disk/dist/g
> >
> >>    -> sequential reads
> >>     best_disk = disk;
> >>     if (...)
> >>      choose_next_idle = 1
> >>      continue;
> >>
> >>   for_each_rdev
> >>   -> iterate next rdev
> >>    if (pending == 0)
> >>     best_disk = disk;
> >>     -> choose the next idle disk
> >>     break;
> >>
> >>    if (choose_next_idle)
> >>     -> keep using this rdev if there are no other idle disk
> >>     continue
> >>
> >> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> remove the code:
> >>
> >> -               /* If device is idle, use it */
> >> -               if (pending == 0) {
> >> -                       best_disk = disk;
> >> -                       break;
> >> -               }
> >>
> >> Hence choose next idle will never work now, fix this problem by
> >> following:
> >>
> >> 1) don't set best_disk in this case, read_balance() will choose the best
> >>     disk after iterating all the disks;
> >> 2) add 'pending' so that other idle disk will be chosen;
> >> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>     are rotational, this disk will still be chosen;
> >>
> >> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >> ---
> >>   drivers/md/raid1.c | 21 ++++++++++++---------
> >>   1 file changed, 12 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c60ea58ae8c5..d0bc67e6d068 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          unsigned int min_pending;
> >>          struct md_rdev *rdev;
> >>          int choose_first;
> >> -       int choose_next_idle;
> >>
> >>          /*
> >>           * Check if we can balance. We can balance on the whole
> >> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>          best_pending_disk = -1;
> >>          min_pending = UINT_MAX;
> >>          best_good_sectors = 0;
> >> -       choose_next_idle = 0;
> >>          clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>
> >>          if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                          int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>                          struct raid1_info *mirror = &conf->mirrors[disk];
> >>
> >> -                       best_disk = disk;
> >>                          /*
> >>                           * If buffered sequential IO size exceeds optimal
> >>                           * iosize, check if there is idle disk. If yes, choose
> >> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>                              mirror->next_seq_sect > opt_iosize &&
> >>                              mirror->next_seq_sect - opt_iosize >=
> >>                              mirror->seq_start) {
> >> -                               choose_next_idle = 1;
> >> -                               continue;
> >> +                               /*
> >> +                                * Add 'pending' to avoid choosing this disk if
> >> +                                * there is other idle disk.
> >> +                                * Set 'dist' to 0, so that if there is no other
> >> +                                * idle disk and all disks are rotational, this
> >> +                                * disk will still be chosen.
> >> +                                */
> >> +                               pending++;
> >> +                               dist = 0;
> >
> > There is a problem. If all disks are not idle and there is a disk with
> > dist=0 before the seq disk, it can't read from the seq disk. It will
> > read from the first disk with dist=0. Maybe we can only add the codes
> > which are removed from 2e52d449bcec?
>
> If there is a disk with disk=0, then best_dist_disk will be updated to
> the disk, and best_dist will be updated to 0 already:
>
> // in each iteration
> if (dist < best_dist) {
>         best_dist = dist;
>         btest_disk_disk = disk;
> }
>
> In this case, best_dist will be set to the first disk with dist=0, and
> at last, the disk will be chosen:
>
> if (best_disk == -1) {
>          if (has_nonrot_disk || min_pending == 0)
>                  best_disk = best_pending_disk;
>          else
>                  best_disk = best_dist_disk;
>                 -> the first disk with dist=0;
> }
>
> So, the problem that you concerned should not exist.

Hi Kuai

Thanks for the explanation. You're right. It chooses the first disk
which has dist==0. In the above, you made the same typo error disk=0
as the comment. I guess you want to use dist=0, right? Beside this,
this patch is good to me.

Best Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Best Regards
> > Xiao
> >
> >> +                       } else {
> >> +                               best_disk = disk;
> >> +                               break;
> >>                          }
> >> -                       break;
> >>                  }
> >>
> >> -               if (choose_next_idle)
> >> -                       continue;
> >> -
> >>                  if (min_pending > pending) {
> >>                          min_pending = pending;
> >>                          best_pending_disk = disk;
> >> --
> >> 2.39.2
> >>
> >>
> >
> > .
> >
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Yu Kuai 1 year, 11 months ago

Hi,

在 2024/02/26 17:24, Xiao Ni 写道:
> On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/02/26 16:55, Xiao Ni 写道:
>>> Hi Kuai
>>>
>>> Thanks for the effort!
>>>
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>>
>>>> read_balance:
>>>>    for_each_rdev
>>>>     if(next_seq_sect == this_sector || disk == 0)
>>>
>>> typo error: s/disk/dist/g
>>>
>>>>     -> sequential reads
>>>>      best_disk = disk;
>>>>      if (...)
>>>>       choose_next_idle = 1
>>>>       continue;
>>>>
>>>>    for_each_rdev
>>>>    -> iterate next rdev
>>>>     if (pending == 0)
>>>>      best_disk = disk;
>>>>      -> choose the next idle disk
>>>>      break;
>>>>
>>>>     if (choose_next_idle)
>>>>      -> keep using this rdev if there are no other idle disk
>>>>      continue
>>>>
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>>
>>>> -               /* If device is idle, use it */
>>>> -               if (pending == 0) {
>>>> -                       best_disk = disk;
>>>> -                       break;
>>>> -               }
>>>>
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>>
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>>      disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>>      are rotational, this disk will still be chosen;
>>>>
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    drivers/md/raid1.c | 21 ++++++++++++---------
>>>>    1 file changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>           unsigned int min_pending;
>>>>           struct md_rdev *rdev;
>>>>           int choose_first;
>>>> -       int choose_next_idle;
>>>>
>>>>           /*
>>>>            * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>           best_pending_disk = -1;
>>>>           min_pending = UINT_MAX;
>>>>           best_good_sectors = 0;
>>>> -       choose_next_idle = 0;
>>>>           clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>>           if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                           int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>>                           struct raid1_info *mirror = &conf->mirrors[disk];
>>>>
>>>> -                       best_disk = disk;
>>>>                           /*
>>>>                            * If buffered sequential IO size exceeds optimal
>>>>                            * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                               mirror->next_seq_sect > opt_iosize &&
>>>>                               mirror->next_seq_sect - opt_iosize >=
>>>>                               mirror->seq_start) {
>>>> -                               choose_next_idle = 1;
>>>> -                               continue;
>>>> +                               /*
>>>> +                                * Add 'pending' to avoid choosing this disk if
>>>> +                                * there is other idle disk.
>>>> +                                * Set 'dist' to 0, so that if there is no other
>>>> +                                * idle disk and all disks are rotational, this
>>>> +                                * disk will still be chosen.
>>>> +                                */
>>>> +                               pending++;
>>>> +                               dist = 0;
>>>
>>> There is a problem. If all disks are not idle and there is a disk with
>>> dist=0 before the seq disk, it can't read from the seq disk. It will
>>> read from the first disk with dist=0. Maybe we can only add the codes
>>> which are removed from 2e52d449bcec?
>>
>> If there is a disk with disk=0, then best_dist_disk will be updated to
>> the disk, and best_dist will be updated to 0 already:
>>
>> // in each iteration
>> if (dist < best_dist) {
>>          best_dist = dist;
>>          btest_disk_disk = disk;
>> }
>>
>> In this case, best_dist will be set to the first disk with dist=0, and
>> at last, the disk will be chosen:
>>
>> if (best_disk == -1) {
>>           if (has_nonrot_disk || min_pending == 0)
>>                   best_disk = best_pending_disk;
>>           else
>>                   best_disk = best_dist_disk;
>>                  -> the first disk with dist=0;
>> }
>>
>> So, the problem that you concerned should not exist.
> 
> Hi Kuai
> 
> Thanks for the explanation. You're right. It chooses the first disk
> which has dist==0. In the above, you made the same typo error disk=0
> as the comment. I guess you want to use dist=0, right? Beside this,
> this patch is good to me.

Yes, and Paul change the name 'best_dist' to 'closest_dist_disk',
and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo
like this. :)

Thanks,
Kuai


> 
> Best Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>> Best Regards
>>> Xiao
>>>
>>>> +                       } else {
>>>> +                               best_disk = disk;
>>>> +                               break;
>>>>                           }
>>>> -                       break;
>>>>                   }
>>>>
>>>> -               if (choose_next_idle)
>>>> -                       continue;
>>>> -
>>>>                   if (min_pending > pending) {
>>>>                           min_pending = pending;
>>>>                           best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
>

Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Posted by Xiao Ni 1 year, 11 months ago

On Mon, Feb 26, 2024 at 5:40 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/02/26 17:24, Xiao Ni 写道:
> > On Mon, Feb 26, 2024 at 5:12 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> 在 2024/02/26 16:55, Xiao Ni 写道:
> >>> Hi Kuai
> >>>
> >>> Thanks for the effort!
> >>>
> >>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>>>
> >>>> From: Yu Kuai <yukuai3@huawei.com>
> >>>>
> >>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
> >>>> the case choose next idle in read_balance():
> >>>>
> >>>> read_balance:
> >>>>    for_each_rdev
> >>>>     if(next_seq_sect == this_sector || disk == 0)
> >>>
> >>> typo error: s/disk/dist/g
> >>>
> >>>>     -> sequential reads
> >>>>      best_disk = disk;
> >>>>      if (...)
> >>>>       choose_next_idle = 1
> >>>>       continue;
> >>>>
> >>>>    for_each_rdev
> >>>>    -> iterate next rdev
> >>>>     if (pending == 0)
> >>>>      best_disk = disk;
> >>>>      -> choose the next idle disk
> >>>>      break;
> >>>>
> >>>>     if (choose_next_idle)
> >>>>      -> keep using this rdev if there are no other idle disk
> >>>>      continue
> >>>>
> >>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> remove the code:
> >>>>
> >>>> -               /* If device is idle, use it */
> >>>> -               if (pending == 0) {
> >>>> -                       best_disk = disk;
> >>>> -                       break;
> >>>> -               }
> >>>>
> >>>> Hence choose next idle will never work now, fix this problem by
> >>>> following:
> >>>>
> >>>> 1) don't set best_disk in this case, read_balance() will choose the best
> >>>>      disk after iterating all the disks;
> >>>> 2) add 'pending' so that other idle disk will be chosen;
> >>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
> >>>>      are rotational, this disk will still be chosen;
> >>>>
> >>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> >>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
> >>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
> >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >>>> ---
> >>>>    drivers/md/raid1.c | 21 ++++++++++++---------
> >>>>    1 file changed, 12 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>> index c60ea58ae8c5..d0bc67e6d068 100644
> >>>> --- a/drivers/md/raid1.c
> >>>> +++ b/drivers/md/raid1.c
> >>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>           unsigned int min_pending;
> >>>>           struct md_rdev *rdev;
> >>>>           int choose_first;
> >>>> -       int choose_next_idle;
> >>>>
> >>>>           /*
> >>>>            * Check if we can balance. We can balance on the whole
> >>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>           best_pending_disk = -1;
> >>>>           min_pending = UINT_MAX;
> >>>>           best_good_sectors = 0;
> >>>> -       choose_next_idle = 0;
> >>>>           clear_bit(R1BIO_FailFast, &r1_bio->state);
> >>>>
> >>>>           if ((conf->mddev->recovery_cp < this_sector + sectors) ||
> >>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>                           int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
> >>>>                           struct raid1_info *mirror = &conf->mirrors[disk];
> >>>>
> >>>> -                       best_disk = disk;
> >>>>                           /*
> >>>>                            * If buffered sequential IO size exceeds optimal
> >>>>                            * iosize, check if there is idle disk. If yes, choose
> >>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
> >>>>                               mirror->next_seq_sect > opt_iosize &&
> >>>>                               mirror->next_seq_sect - opt_iosize >=
> >>>>                               mirror->seq_start) {
> >>>> -                               choose_next_idle = 1;
> >>>> -                               continue;
> >>>> +                               /*
> >>>> +                                * Add 'pending' to avoid choosing this disk if
> >>>> +                                * there is other idle disk.
> >>>> +                                * Set 'dist' to 0, so that if there is no other
> >>>> +                                * idle disk and all disks are rotational, this
> >>>> +                                * disk will still be chosen.
> >>>> +                                */
> >>>> +                               pending++;
> >>>> +                               dist = 0;
> >>>
> >>> There is a problem. If all disks are not idle and there is a disk with
> >>> dist=0 before the seq disk, it can't read from the seq disk. It will
> >>> read from the first disk with dist=0. Maybe we can only add the codes
> >>> which are removed from 2e52d449bcec?
> >>
> >> If there is a disk with disk=0, then best_dist_disk will be updated to
> >> the disk, and best_dist will be updated to 0 already:
> >>
> >> // in each iteration
> >> if (dist < best_dist) {
> >>          best_dist = dist;
> >>          btest_disk_disk = disk;
> >> }
> >>
> >> In this case, best_dist will be set to the first disk with dist=0, and
> >> at last, the disk will be chosen:
> >>
> >> if (best_disk == -1) {
> >>           if (has_nonrot_disk || min_pending == 0)
> >>                   best_disk = best_pending_disk;
> >>           else
> >>                   best_disk = best_dist_disk;
> >>                  -> the first disk with dist=0;
> >> }
> >>
> >> So, the problem that you concerned should not exist.
> >
> > Hi Kuai
> >
> > Thanks for the explanation. You're right. It chooses the first disk
> > which has dist==0. In the above, you made the same typo error disk=0
> > as the comment. I guess you want to use dist=0, right? Beside this,
> > this patch is good to me.
>
> Yes, and Paul change the name 'best_dist' to 'closest_dist_disk',
> and 'btest_disk_disk' to 'closest_dist' in the last patch to avoid typo
> like this. :)

Ah, thanks :)  I haven't been there.

Regards
Xiao
>
> Thanks,
> Kuai
>
>
> >
> > Best Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Best Regards
> >>> Xiao
> >>>
> >>>> +                       } else {
> >>>> +                               best_disk = disk;
> >>>> +                               break;
> >>>>                           }
> >>>> -                       break;
> >>>>                   }
> >>>>
> >>>> -               if (choose_next_idle)
> >>>> -                       continue;
> >>>> -
> >>>>                   if (min_pending > pending) {
> >>>>                           min_pending = pending;
> >>>>                           best_pending_disk = disk;
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>