[v3] block/md/dm: set chunk_sectors from stacked dev stripe size

[PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by John Garry 3 months ago

The atomic write unit max value is limited by any stacked device stripe
size.

It is required that the atomic write unit is a power-of-2 factor of the
stripe size.

Currently we use io_min limit to hold the stripe size, and check for a
io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.

Nilay reports that this causes a problem when the physical block size is
greater than SECTOR_SIZE [0].

Furthermore, io_min may be mutated when stacking devices, and this makes
it a poor candidate to hold the stripe size. Such an example (of when
io_min may change) would be when the io_min is less than the physical
block size.

Use chunk_sectors to hold the stripe size, which is more appropriate.

[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 7ca21fb32598..20d3563f5d3f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
 	return true;
 }
 
+static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
+{
+	return 1 << (ffs(nr) - 1);
+}
 
-/* Check stacking of first bottom device */
-static bool blk_stack_atomic_writes_head(struct queue_limits *t,
-				struct queue_limits *b)
+static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
 {
-	if (b->atomic_write_hw_boundary &&
-	    !blk_stack_atomic_writes_boundary_head(t, b))
-		return false;
+	unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
 
-	if (t->io_min <= SECTOR_SIZE) {
-		/* No chunk sectors, so use bottom device values directly */
-		t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
-		t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
-		t->atomic_write_hw_max = b->atomic_write_hw_max;
-		return true;
-	}
+	if (!t->chunk_sectors)
+		return;
 
 	/*
 	 * Find values for limits which work for chunk size.
 	 * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
-	 * size (t->io_min), as chunk size is not restricted to a power-of-2.
+	 * size, as the chunk size is not restricted to a power-of-2.
 	 * So we need to find highest power-of-2 which works for the chunk
 	 * size.
-	 * As an example scenario, we could have b->unit_max = 16K and
-	 * t->io_min = 24K. For this case, reduce t->unit_max to a value
-	 * aligned with both limits, i.e. 8K in this example.
+	 * As an example scenario, we could have t->unit_max = 16K and
+	 * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
+	 * value aligned with both limits, i.e. 8K in this example.
 	 */
-	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
-	while (t->io_min % t->atomic_write_hw_unit_max)
-		t->atomic_write_hw_unit_max /= 2;
+	t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
+					max_pow_of_two_factor(chunk_bytes));
 
-	t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
+	t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
 					  t->atomic_write_hw_unit_max);
-	t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
+	t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
+}
 
+/* Check stacking of first bottom device */
+static bool blk_stack_atomic_writes_head(struct queue_limits *t,
+				struct queue_limits *b)
+{
+	if (b->atomic_write_hw_boundary &&
+	    !blk_stack_atomic_writes_boundary_head(t, b))
+		return false;
+
+	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
+	t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
+	t->atomic_write_hw_max = b->atomic_write_hw_max;
 	return true;
 }
 
@@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
 
 	if (!blk_stack_atomic_writes_head(t, b))
 		goto unsupported;
+	blk_stack_atomic_writes_chunk_sectors(t);
 	return;
 
 unsupported:
-- 
2.43.5

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by Mikulas Patocka 3 months ago


On Thu, 3 Jul 2025, John Garry wrote:

> The atomic write unit max value is limited by any stacked device stripe
> size.
> 
> It is required that the atomic write unit is a power-of-2 factor of the
> stripe size.
> 
> Currently we use io_min limit to hold the stripe size, and check for a
> io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
> 
> Nilay reports that this causes a problem when the physical block size is
> greater than SECTOR_SIZE [0].
> 
> Furthermore, io_min may be mutated when stacking devices, and this makes
> it a poor candidate to hold the stripe size. Such an example (of when
> io_min may change) would be when the io_min is less than the physical
> block size.
> 
> Use chunk_sectors to hold the stripe size, which is more appropriate.
> 
> [0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 22 deletions(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 7ca21fb32598..20d3563f5d3f 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
>  	return true;
>  }
>  
> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
> +{
> +	return 1 << (ffs(nr) - 1);

This could be changed to "nr & -nr".

> +}
>  
> -/* Check stacking of first bottom device */
> -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> -				struct queue_limits *b)
> +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
>  {
> -	if (b->atomic_write_hw_boundary &&
> -	    !blk_stack_atomic_writes_boundary_head(t, b))
> -		return false;
> +	unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;

What about integer overflow?

> -	if (t->io_min <= SECTOR_SIZE) {
> -		/* No chunk sectors, so use bottom device values directly */
> -		t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> -		t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
> -		t->atomic_write_hw_max = b->atomic_write_hw_max;
> -		return true;
> -	}
> +	if (!t->chunk_sectors)
> +		return;
>  
>  	/*
>  	 * Find values for limits which work for chunk size.
>  	 * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
> -	 * size (t->io_min), as chunk size is not restricted to a power-of-2.
> +	 * size, as the chunk size is not restricted to a power-of-2.
>  	 * So we need to find highest power-of-2 which works for the chunk
>  	 * size.
> -	 * As an example scenario, we could have b->unit_max = 16K and
> -	 * t->io_min = 24K. For this case, reduce t->unit_max to a value
> -	 * aligned with both limits, i.e. 8K in this example.
> +	 * As an example scenario, we could have t->unit_max = 16K and
> +	 * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
> +	 * value aligned with both limits, i.e. 8K in this example.
>  	 */
> -	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> -	while (t->io_min % t->atomic_write_hw_unit_max)
> -		t->atomic_write_hw_unit_max /= 2;
> +	t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
> +					max_pow_of_two_factor(chunk_bytes));
>  
> -	t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
> +	t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
>  					  t->atomic_write_hw_unit_max);
> -	t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
> +	t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
> +}
>  
> +/* Check stacking of first bottom device */
> +static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> +				struct queue_limits *b)
> +{
> +	if (b->atomic_write_hw_boundary &&
> +	    !blk_stack_atomic_writes_boundary_head(t, b))
> +		return false;
> +
> +	t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> +	t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
> +	t->atomic_write_hw_max = b->atomic_write_hw_max;
>  	return true;
>  }
>  
> @@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
>  
>  	if (!blk_stack_atomic_writes_head(t, b))
>  		goto unsupported;
> +	blk_stack_atomic_writes_chunk_sectors(t);
>  	return;
>  
>  unsupported:
> -- 
> 2.43.5
>

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by John Garry 3 months ago

On 03/07/2025 14:31, Mikulas Patocka wrote:
> 
> 
> On Thu, 3 Jul 2025, John Garry wrote:
> 
>> The atomic write unit max value is limited by any stacked device stripe
>> size.
>>
>> It is required that the atomic write unit is a power-of-2 factor of the
>> stripe size.
>>
>> Currently we use io_min limit to hold the stripe size, and check for a
>> io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
>>
>> Nilay reports that this causes a problem when the physical block size is
>> greater than SECTOR_SIZE [0].
>>
>> Furthermore, io_min may be mutated when stacking devices, and this makes
>> it a poor candidate to hold the stripe size. Such an example (of when
>> io_min may change) would be when the io_min is less than the physical
>> block size.
>>
>> Use chunk_sectors to hold the stripe size, which is more appropriate.
>>
>> [0] https://urldefense.com/v3/__https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/*mecca17129f72811137d3c2f1e477634e77f06781__;Iw!!ACWV5N9M2RV99hQ!OoKnbVR6yxyDj7-7bpZceNOD59hud0wfw_-fZLPgcGi9XdFQyfpfFFmbYzR_HdvM8epaJqe_dCGnIEgDPMze$
>>
>> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
>>   1 file changed, 29 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 7ca21fb32598..20d3563f5d3f 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
>>   	return true;
>>   }
>>   
>> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
>> +{
>> +	return 1 << (ffs(nr) - 1);
> 
> This could be changed to "nr & -nr".

Sure, but I doubt if that is a more natural form.

> 
>> +}
>>   
>> -/* Check stacking of first bottom device */
>> -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
>> -				struct queue_limits *b)
>> +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
>>   {
>> -	if (b->atomic_write_hw_boundary &&
>> -	    !blk_stack_atomic_writes_boundary_head(t, b))
>> -		return false;
>> +	unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
> 
> What about integer overflow?

I suppose theoretically it could happen, and I'm happy to change.

However there seems to be precedent in assuming it won't:

- in stripe_op_hints(), we hold chunk_size in an unsigned int
- in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in 
lim.io_min, which is an unsigned int type.

Please let me know your thoughts on also changing these sort of 
instances. Is it realistic to expect chunk_bytes > UINT_MAX?

Thanks,
John

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by Mikulas Patocka 3 months ago


On Thu, 3 Jul 2025, John Garry wrote:

> > >   -/* Check stacking of first bottom device */
> > > -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> > > -				struct queue_limits *b)
> > > +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
> > >   {
> > > -	if (b->atomic_write_hw_boundary &&
> > > -	    !blk_stack_atomic_writes_boundary_head(t, b))
> > > -		return false;
> > > +	unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
> > 
> > What about integer overflow?
> 
> I suppose theoretically it could happen, and I'm happy to change.
> 
> However there seems to be precedent in assuming it won't:
> 
> - in stripe_op_hints(), we hold chunk_size in an unsigned int
> - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min,
> which is an unsigned int type.
> 
> Please let me know your thoughts on also changing these sort of instances. Is
> it realistic to expect chunk_bytes > UINT_MAX?
> 
> Thanks,
> John

dm-stripe can be created with a stripe size that is more than 0xffffffff 
bytes.

Though, the integer overflow already exists in the existing dm-stripe 
target:
static void stripe_io_hints(struct dm_target *ti,
                            struct queue_limits *limits)
{
        struct stripe_c *sc = ti->private;
        unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT;

        limits->io_min = chunk_size;
        limits->io_opt = chunk_size * sc->stripes;
}
What should we set there as io_min and io_opt if sc->chunk_size << 
SECTOR_SHIFT overflows? Should we set nothing?

Mikulas

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by John Garry 3 months ago

On 03/07/2025 16:36, Mikulas Patocka wrote:
>> I suppose theoretically it could happen, and I'm happy to change.
>>
>> However there seems to be precedent in assuming it won't:
>>
>> - in stripe_op_hints(), we hold chunk_size in an unsigned int
>> - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min,
>> which is an unsigned int type.
>>
>> Please let me know your thoughts on also changing these sort of instances. Is
>> it realistic to expect chunk_bytes > UINT_MAX?
>>
>> Thanks,
>> John
> dm-stripe can be created with a stripe size that is more than 0xffffffff
> bytes.
> 
> Though, the integer overflow already exists in the existing dm-stripe
> target:
> static void stripe_io_hints(struct dm_target *ti,
>                              struct queue_limits *limits)
> {
>          struct stripe_c *sc = ti->private;
>          unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT;
> 
>          limits->io_min = chunk_size;
>          limits->io_opt = chunk_size * sc->stripes;
> }
> What should we set there as io_min and io_opt if sc->chunk_size <<
> SECTOR_SHIFT overflows?


> Should we set nothing?

For io_min/opt, maybe reduce to a factor of the stripe size / width (and 
which fits in a unsigned int).

I am not sure if it is even sane to have such huge values in io_min and 
the bottom disk io_min should be used directly instead.

Martin Petersen might have a better idea.. he added those sysfs files :)

Thanks,
John

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by Martin K. Petersen 3 months ago

John,

> For io_min/opt, maybe reduce to a factor of the stripe size / width
> (and which fits in a unsigned int).
>
> I am not sure if it is even sane to have such huge values in io_min
> and the bottom disk io_min should be used directly instead.

The intent for io_min was to convey the physical_block_size in the case
of an individual drive. And for it to be set to the stripe chunk size in
stacking scenarios that would otherwise involve read-modify-write (i.e.
RAID5 and RAID6).

io_opt was meant to communicate the stripe width. Reporting very large
values for io_opt is generally counterproductive since we can't write
multiple gigabytes in a single operation anyway.

logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors

-- 
Martin K. Petersen

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by John Garry 2 months, 2 weeks ago

On 09/07/2025 02:39, Martin K. Petersen wrote:
> The intent for io_min was to convey the physical_block_size in the case
> of an individual drive. And for it to be set to the stripe chunk size in
> stacking scenarios that would otherwise involve read-modify-write (i.e.
> RAID5 and RAID6).
> 
> io_opt was meant to communicate the stripe width. Reporting very large
> values for io_opt is generally counterproductive since we can't write
> multiple gigabytes in a single operation anyway.
> 
> logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors

Does pbs need to be a power-of-2?

The block queue limits and splitting code seems to rely on that, but it 
is not policed AFAICS.

The stacking code seems to just want it to be a multiple of lbs, which 
itself must be a power-of-2.

Thanks,
John

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by Martin K. Petersen 2 months, 2 weeks ago

John,

> Does pbs need to be a power-of-2?

For ATA and SCSI, definitely. NVMe could technically report a multiple
of the logical block size but I've never come across anything that
didn't report a power of two.

-- 
Martin K. Petersen

Re: [PATCH v3 5/5] block: use chunk_sectors when evaluating stacked atomic write limits

Posted by John Garry 3 months ago

On 09/07/2025 02:39, Martin K. Petersen wrote:
> 
> John,
> 
>> For io_min/opt, maybe reduce to a factor of the stripe size / width
>> (and which fits in a unsigned int).
>>
>> I am not sure if it is even sane to have such huge values in io_min
>> and the bottom disk io_min should be used directly instead.
> 
> The intent for io_min was to convey the physical_block_size in the case
> of an individual drive. And for it to be set to the stripe chunk size in
> stacking scenarios that would otherwise involve read-modify-write (i.e.
> RAID5 and RAID6).

And the same is done for md raid0/dm stripe; I suppose that the idea is 
to encourage larger than chunk size writes to see the performance 
advantage in striping.

 > > io_opt was meant to communicate the stripe width. Reporting very large
> values for io_opt is generally counterproductive since we can't write
> multiple gigabytes in a single operation anyway.
>

Right, and so it seems counterproductive to have chunk size much bigger 
than bottom device io_opt

> logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors
>