md/raid1: set max_sectors during early return from choose_slow_rdev()

[PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

Posted by Mateusz Jończyk 1 year, 7 months ago

Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
when that drive has a write-mostly flag set. During such an attempt,
the following assertion in bio_split() is hit:

	BUG_ON(sectors <= 0);

Call Trace:
	? bio_split+0x96/0xb0
	? exc_invalid_op+0x53/0x70
	? bio_split+0x96/0xb0
	? asm_exc_invalid_op+0x1b/0x20
	? bio_split+0x96/0xb0
	? raid1_read_request+0x890/0xd20
	? __call_rcu_common.constprop.0+0x97/0x260
	raid1_make_request+0x81/0xce0
	? __get_random_u32_below+0x17/0x70
	? new_slab+0x2b3/0x580
	md_handle_request+0x77/0x210
	md_submit_bio+0x62/0xa0
	__submit_bio+0x17b/0x230
	submit_bio_noacct_nocheck+0x18e/0x3c0
	submit_bio_noacct+0x244/0x670

After investigation, it turned out that choose_slow_rdev() does not set
the value of max_sectors in some cases and because of it,
raid1_read_request calls bio_split with sectors == 0.

Fix it by filling in this variable.

This bug was introduced in
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
but apparently hidden until
commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
shortly thereafter.

Cc: stable@vger.kernel.org # 6.9.x+
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
Cc: Song Liu <song@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paul Luse <paul.e.luse@linux.intel.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/

--

Tested on both Linux 6.10 and 6.9.8.

Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
	./test --dev=loop --no-error --raidtype=raid1
(on 6.9.8 there was one failure, caused by external bitmap support not
compiled in).

Notes:
- I was reliably getting deadlocks when adding / removing devices
  on such an array - while the array was loaded with fsstress with 20
  concurrent processes. When the array was idle or loaded with fsstress
  with 8 processes, no such deadlocks happened in my tests.
  This occurred also on unpatched Linux 6.8.0 though, but not on
  6.1.97-rc1, so this is likely an independent regression (to be
  investigated).
- I was also getting deadlocks when adding / removing the bitmap on the
  array in similar conditions - this happened on Linux 6.1.97-rc1
  also though. fsstress with 8 concurrent processes did cause it only
  once during many tests.
- in my testing, there was once a problem with hot adding an
  internal bitmap to the array:
	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
	mdadm: failed to set internal bitmap.
  even though no such reshaping was happening according to /proc/mdstat.
  This seems unrelated, though.
---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}

base-commit: 256abd8e550ce977b728be79a74e1729438b4948
-- 
2.25.1

Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

Posted by Yu Kuai 1 year, 7 months ago

Hi,

在 2024/07/12 4:23, Mateusz Jończyk 写道:
> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 
> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not set
> the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --

Thanks for the patch!

Reviewed-by: Yu Kuai <yukuai3@huawei.com>

BTW, do you have plans to add a new test to mdadm tests? I'll
pick it up if you don't, just let me know.

Thanks,
Kuai

> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems:
> 	./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>    on such an array - while the array was loaded with fsstress with 20
>    concurrent processes. When the array was idle or loaded with fsstress
>    with 8 processes, no such deadlocks happened in my tests.
>    This occurred also on unpatched Linux 6.8.0 though, but not on
>    6.1.97-rc1, so this is likely an independent regression (to be
>    investigated).
> - I was also getting deadlocks when adding / removing the bitmap on the
>    array in similar conditions - this happened on Linux 6.1.97-rc1
>    also though. fsstress with 8 concurrent processes did cause it only
>    once during many tests.
> - in my testing, there was once a problem with hot adding an
>    internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or reshaping etc.
> 	mdadm: failed to set internal bitmap.
>    even though no such reshaping was happening according to /proc/mdstat.
>    This seems unrelated, though.
> ---
>   drivers/md/raid1.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>   		len = r1_bio->sectors;
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
>   		if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>   			update_read_sectors(conf, disk, this_sector, read_len);
>   			return disk;
>   		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948
>

Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

Posted by Mateusz Jończyk 1 year, 6 months ago

W dniu 12.07.2024 o 03:16, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/12 4:23, Mateusz Jończyk 写道:
>> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
>> when that drive has a write-mostly flag set. During such an attempt,
>> the following assertion in bio_split() is hit:
>>
[snip]
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
> BTW, do you have plans to add a new test to mdadm tests? I'll
> pick it up if you don't, just let me know.
>
> Thanks,
> Kuai

Yes, I'm working on it.

Greetings,

Mateusz

Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

Posted by Song Liu 1 year, 7 months ago

On Fri, Jul 12, 2024 at 9:17 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
[...]
> >
> > After investigation, it turned out that choose_slow_rdev() does not set
> > the value of max_sectors in some cases and because of it,
> > raid1_read_request calls bio_split with sectors == 0.
> >
> > Fix it by filling in this variable.
> >
> > This bug was introduced in
> > commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > but apparently hidden until
> > commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()")
> > shortly thereafter.
> >
> > Cc: stable@vger.kernel.org # 6.9.x+
> > Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> > Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> > Cc: Song Liu <song@kernel.org>
> > Cc: Yu Kuai <yukuai3@huawei.com>
> > Cc: Paul Luse <paul.e.luse@linux.intel.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> > Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> >
> > --
>
> Thanks for the patch!
>
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>

Applied to md-6.11. Thanks!

Song

Re: [PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

Posted by Paul E Luse 1 year, 7 months ago

On Thu, 11 Jul 2024 22:23:16 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:

> Linux 6.9+ is unable to start a degraded RAID1 array with one drive,
> when that drive has a write-mostly flag set. During such an attempt,
> the following assertion in bio_split() is hit:
> 

Nice catch and good patch :)  Kwai?

-Paul

> 	BUG_ON(sectors <= 0);
> 
> Call Trace:
> 	? bio_split+0x96/0xb0
> 	? exc_invalid_op+0x53/0x70
> 	? bio_split+0x96/0xb0
> 	? asm_exc_invalid_op+0x1b/0x20
> 	? bio_split+0x96/0xb0
> 	? raid1_read_request+0x890/0xd20
> 	? __call_rcu_common.constprop.0+0x97/0x260
> 	raid1_make_request+0x81/0xce0
> 	? __get_random_u32_below+0x17/0x70
> 	? new_slab+0x2b3/0x580
> 	md_handle_request+0x77/0x210
> 	md_submit_bio+0x62/0xa0
> 	__submit_bio+0x17b/0x230
> 	submit_bio_noacct_nocheck+0x18e/0x3c0
> 	submit_bio_noacct+0x244/0x670
> 
> After investigation, it turned out that choose_slow_rdev() does not
> set the value of max_sectors in some cases and because of it,
> raid1_read_request calls bio_split with sectors == 0.
> 
> Fix it by filling in this variable.
> 
> This bug was introduced in
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") but apparently hidden until
> commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best
> rdev from read_balance()") shortly thereafter.
> 
> Cc: stable@vger.kernel.org # 6.9.x+
> Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from
> read_balance()") Cc: Song Liu <song@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Paul Luse <paul.e.luse@linux.intel.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> Link:
> https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/
> 
> --
> 
> Tested on both Linux 6.10 and 6.9.8.
> 
> Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any
> problems: ./test --dev=loop --no-error --raidtype=raid1
> (on 6.9.8 there was one failure, caused by external bitmap support not
> compiled in).
> 
> Notes:
> - I was reliably getting deadlocks when adding / removing devices
>   on such an array - while the array was loaded with fsstress with 20
>   concurrent processes. When the array was idle or loaded with
> fsstress with 8 processes, no such deadlocks happened in my tests.
>   This occurred also on unpatched Linux 6.8.0 though, but not on
>   6.1.97-rc1, so this is likely an independent regression (to be
>   investigated).
> - I was also getting deadlocks when adding / removing the bitmap on
> the array in similar conditions - this happened on Linux 6.1.97-rc1
>   also though. fsstress with 8 concurrent processes did cause it only
>   once during many tests.
> - in my testing, there was once a problem with hot adding an
>   internal bitmap to the array:
> 	mdadm: Cannot add bitmap while array is resyncing or
> reshaping etc. mdadm: failed to set internal bitmap.
>   even though no such reshaping was happening according to
> /proc/mdstat. This seems unrelated, though.
> ---
>  drivers/md/raid1.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf,
> struct r1bio *r1_bio, len = r1_bio->sectors;
>  		read_len = raid1_check_read_range(rdev, this_sector,
> &len); if (read_len == r1_bio->sectors) {
> +			*max_sectors = read_len;
>  			update_read_sectors(conf, disk, this_sector,
> read_len); return disk;
>  		}
> 
> base-commit: 256abd8e550ce977b728be79a74e1729438b4948