The atomic write unit max value is limited by any stacked device stripe
size.
It is required that the atomic write unit is a power-of-2 factor of the
stripe size.
Currently we use io_min limit to hold the stripe size, and check for a
io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
Nilay reports that this causes a problem when the physical block size is
greater than SECTOR_SIZE [0].
Furthermore, io_min may be mutated when stacking devices, and this makes
it a poor candidate to hold the stripe size. Such an example (of when
io_min may change) would be when the io_min is less than the physical
block size.
Use chunk_sectors to hold the stripe size, which is more appropriate.
[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
1 file changed, 29 insertions(+), 22 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 7ca21fb32598..20d3563f5d3f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
return true;
}
+static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
+{
+ return 1 << (ffs(nr) - 1);
+}
-/* Check stacking of first bottom device */
-static bool blk_stack_atomic_writes_head(struct queue_limits *t,
- struct queue_limits *b)
+static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
{
- if (b->atomic_write_hw_boundary &&
- !blk_stack_atomic_writes_boundary_head(t, b))
- return false;
+ unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
- if (t->io_min <= SECTOR_SIZE) {
- /* No chunk sectors, so use bottom device values directly */
- t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
- t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
- t->atomic_write_hw_max = b->atomic_write_hw_max;
- return true;
- }
+ if (!t->chunk_sectors)
+ return;
/*
* Find values for limits which work for chunk size.
* b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
- * size (t->io_min), as chunk size is not restricted to a power-of-2.
+ * size, as the chunk size is not restricted to a power-of-2.
* So we need to find highest power-of-2 which works for the chunk
* size.
- * As an example scenario, we could have b->unit_max = 16K and
- * t->io_min = 24K. For this case, reduce t->unit_max to a value
- * aligned with both limits, i.e. 8K in this example.
+ * As an example scenario, we could have t->unit_max = 16K and
+ * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
+ * value aligned with both limits, i.e. 8K in this example.
*/
- t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
- while (t->io_min % t->atomic_write_hw_unit_max)
- t->atomic_write_hw_unit_max /= 2;
+ t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
+ max_pow_of_two_factor(chunk_bytes));
- t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
+ t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
t->atomic_write_hw_unit_max);
- t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
+ t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
+}
+/* Check stacking of first bottom device */
+static bool blk_stack_atomic_writes_head(struct queue_limits *t,
+ struct queue_limits *b)
+{
+ if (b->atomic_write_hw_boundary &&
+ !blk_stack_atomic_writes_boundary_head(t, b))
+ return false;
+
+ t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
+ t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
+ t->atomic_write_hw_max = b->atomic_write_hw_max;
return true;
}
@@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
if (!blk_stack_atomic_writes_head(t, b))
goto unsupported;
+ blk_stack_atomic_writes_chunk_sectors(t);
return;
unsupported:
--
2.43.5
On Thu, 3 Jul 2025, John Garry wrote:
> The atomic write unit max value is limited by any stacked device stripe
> size.
>
> It is required that the atomic write unit is a power-of-2 factor of the
> stripe size.
>
> Currently we use io_min limit to hold the stripe size, and check for a
> io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
>
> Nilay reports that this causes a problem when the physical block size is
> greater than SECTOR_SIZE [0].
>
> Furthermore, io_min may be mutated when stacking devices, and this makes
> it a poor candidate to hold the stripe size. Such an example (of when
> io_min may change) would be when the io_min is less than the physical
> block size.
>
> Use chunk_sectors to hold the stripe size, which is more appropriate.
>
> [0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
> 1 file changed, 29 insertions(+), 22 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 7ca21fb32598..20d3563f5d3f 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
> return true;
> }
>
> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
> +{
> + return 1 << (ffs(nr) - 1);
This could be changed to "nr & -nr".
> +}
>
> -/* Check stacking of first bottom device */
> -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> - struct queue_limits *b)
> +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
> {
> - if (b->atomic_write_hw_boundary &&
> - !blk_stack_atomic_writes_boundary_head(t, b))
> - return false;
> + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
What about integer overflow?
> - if (t->io_min <= SECTOR_SIZE) {
> - /* No chunk sectors, so use bottom device values directly */
> - t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> - t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
> - t->atomic_write_hw_max = b->atomic_write_hw_max;
> - return true;
> - }
> + if (!t->chunk_sectors)
> + return;
>
> /*
> * Find values for limits which work for chunk size.
> * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
> - * size (t->io_min), as chunk size is not restricted to a power-of-2.
> + * size, as the chunk size is not restricted to a power-of-2.
> * So we need to find highest power-of-2 which works for the chunk
> * size.
> - * As an example scenario, we could have b->unit_max = 16K and
> - * t->io_min = 24K. For this case, reduce t->unit_max to a value
> - * aligned with both limits, i.e. 8K in this example.
> + * As an example scenario, we could have t->unit_max = 16K and
> + * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
> + * value aligned with both limits, i.e. 8K in this example.
> */
> - t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> - while (t->io_min % t->atomic_write_hw_unit_max)
> - t->atomic_write_hw_unit_max /= 2;
> + t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
> + max_pow_of_two_factor(chunk_bytes));
>
> - t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
> + t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
> t->atomic_write_hw_unit_max);
> - t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
> + t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
> +}
>
> +/* Check stacking of first bottom device */
> +static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> + struct queue_limits *b)
> +{
> + if (b->atomic_write_hw_boundary &&
> + !blk_stack_atomic_writes_boundary_head(t, b))
> + return false;
> +
> + t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
> + t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
> + t->atomic_write_hw_max = b->atomic_write_hw_max;
> return true;
> }
>
> @@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
>
> if (!blk_stack_atomic_writes_head(t, b))
> goto unsupported;
> + blk_stack_atomic_writes_chunk_sectors(t);
> return;
>
> unsupported:
> --
> 2.43.5
>
On 03/07/2025 14:31, Mikulas Patocka wrote:
>
>
> On Thu, 3 Jul 2025, John Garry wrote:
>
>> The atomic write unit max value is limited by any stacked device stripe
>> size.
>>
>> It is required that the atomic write unit is a power-of-2 factor of the
>> stripe size.
>>
>> Currently we use io_min limit to hold the stripe size, and check for a
>> io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
>>
>> Nilay reports that this causes a problem when the physical block size is
>> greater than SECTOR_SIZE [0].
>>
>> Furthermore, io_min may be mutated when stacking devices, and this makes
>> it a poor candidate to hold the stripe size. Such an example (of when
>> io_min may change) would be when the io_min is less than the physical
>> block size.
>>
>> Use chunk_sectors to hold the stripe size, which is more appropriate.
>>
>> [0] https://urldefense.com/v3/__https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/*mecca17129f72811137d3c2f1e477634e77f06781__;Iw!!ACWV5N9M2RV99hQ!OoKnbVR6yxyDj7-7bpZceNOD59hud0wfw_-fZLPgcGi9XdFQyfpfFFmbYzR_HdvM8epaJqe_dCGnIEgDPMze$
>>
>> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
>> 1 file changed, 29 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 7ca21fb32598..20d3563f5d3f 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
>> return true;
>> }
>>
>> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
>> +{
>> + return 1 << (ffs(nr) - 1);
>
> This could be changed to "nr & -nr".
Sure, but I doubt if that is a more natural form.
>
>> +}
>>
>> -/* Check stacking of first bottom device */
>> -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
>> - struct queue_limits *b)
>> +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
>> {
>> - if (b->atomic_write_hw_boundary &&
>> - !blk_stack_atomic_writes_boundary_head(t, b))
>> - return false;
>> + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
>
> What about integer overflow?
I suppose theoretically it could happen, and I'm happy to change.
However there seems to be precedent in assuming it won't:
- in stripe_op_hints(), we hold chunk_size in an unsigned int
- in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in
lim.io_min, which is an unsigned int type.
Please let me know your thoughts on also changing these sort of
instances. Is it realistic to expect chunk_bytes > UINT_MAX?
Thanks,
John
On Thu, 3 Jul 2025, John Garry wrote:
> > > -/* Check stacking of first bottom device */
> > > -static bool blk_stack_atomic_writes_head(struct queue_limits *t,
> > > - struct queue_limits *b)
> > > +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
> > > {
> > > - if (b->atomic_write_hw_boundary &&
> > > - !blk_stack_atomic_writes_boundary_head(t, b))
> > > - return false;
> > > + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
> >
> > What about integer overflow?
>
> I suppose theoretically it could happen, and I'm happy to change.
>
> However there seems to be precedent in assuming it won't:
>
> - in stripe_op_hints(), we hold chunk_size in an unsigned int
> - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min,
> which is an unsigned int type.
>
> Please let me know your thoughts on also changing these sort of instances. Is
> it realistic to expect chunk_bytes > UINT_MAX?
>
> Thanks,
> John
dm-stripe can be created with a stripe size that is more than 0xffffffff
bytes.
Though, the integer overflow already exists in the existing dm-stripe
target:
static void stripe_io_hints(struct dm_target *ti,
struct queue_limits *limits)
{
struct stripe_c *sc = ti->private;
unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT;
limits->io_min = chunk_size;
limits->io_opt = chunk_size * sc->stripes;
}
What should we set there as io_min and io_opt if sc->chunk_size <<
SECTOR_SHIFT overflows? Should we set nothing?
Mikulas
On 03/07/2025 16:36, Mikulas Patocka wrote:
>> I suppose theoretically it could happen, and I'm happy to change.
>>
>> However there seems to be precedent in assuming it won't:
>>
>> - in stripe_op_hints(), we hold chunk_size in an unsigned int
>> - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min,
>> which is an unsigned int type.
>>
>> Please let me know your thoughts on also changing these sort of instances. Is
>> it realistic to expect chunk_bytes > UINT_MAX?
>>
>> Thanks,
>> John
> dm-stripe can be created with a stripe size that is more than 0xffffffff
> bytes.
>
> Though, the integer overflow already exists in the existing dm-stripe
> target:
> static void stripe_io_hints(struct dm_target *ti,
> struct queue_limits *limits)
> {
> struct stripe_c *sc = ti->private;
> unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT;
>
> limits->io_min = chunk_size;
> limits->io_opt = chunk_size * sc->stripes;
> }
> What should we set there as io_min and io_opt if sc->chunk_size <<
> SECTOR_SHIFT overflows?
> Should we set nothing?
For io_min/opt, maybe reduce to a factor of the stripe size / width (and
which fits in a unsigned int).
I am not sure if it is even sane to have such huge values in io_min and
the bottom disk io_min should be used directly instead.
Martin Petersen might have a better idea.. he added those sysfs files :)
Thanks,
John
John, > For io_min/opt, maybe reduce to a factor of the stripe size / width > (and which fits in a unsigned int). > > I am not sure if it is even sane to have such huge values in io_min > and the bottom disk io_min should be used directly instead. The intent for io_min was to convey the physical_block_size in the case of an individual drive. And for it to be set to the stripe chunk size in stacking scenarios that would otherwise involve read-modify-write (i.e. RAID5 and RAID6). io_opt was meant to communicate the stripe width. Reporting very large values for io_opt is generally counterproductive since we can't write multiple gigabytes in a single operation anyway. logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors -- Martin K. Petersen
On 09/07/2025 02:39, Martin K. Petersen wrote: > The intent for io_min was to convey the physical_block_size in the case > of an individual drive. And for it to be set to the stripe chunk size in > stacking scenarios that would otherwise involve read-modify-write (i.e. > RAID5 and RAID6). > > io_opt was meant to communicate the stripe width. Reporting very large > values for io_opt is generally counterproductive since we can't write > multiple gigabytes in a single operation anyway. > > logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors Does pbs need to be a power-of-2? The block queue limits and splitting code seems to rely on that, but it is not policed AFAICS. The stacking code seems to just want it to be a multiple of lbs, which itself must be a power-of-2. Thanks, John
John, > Does pbs need to be a power-of-2? For ATA and SCSI, definitely. NVMe could technically report a multiple of the logical block size but I've never come across anything that didn't report a power of two. -- Martin K. Petersen
On 09/07/2025 02:39, Martin K. Petersen wrote: > > John, > >> For io_min/opt, maybe reduce to a factor of the stripe size / width >> (and which fits in a unsigned int). >> >> I am not sure if it is even sane to have such huge values in io_min >> and the bottom disk io_min should be used directly instead. > > The intent for io_min was to convey the physical_block_size in the case > of an individual drive. And for it to be set to the stripe chunk size in > stacking scenarios that would otherwise involve read-modify-write (i.e. > RAID5 and RAID6). And the same is done for md raid0/dm stripe; I suppose that the idea is to encourage larger than chunk size writes to see the performance advantage in striping. > > io_opt was meant to communicate the stripe width. Reporting very large > values for io_opt is generally counterproductive since we can't write > multiple gigabytes in a single operation anyway. > Right, and so it seems counterproductive to have chunk size much bigger than bottom device io_opt > logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors >
© 2016 - 2026 Red Hat, Inc.