The atomic write unit max value is limited by any stacked device stripe
size.
It is required that the atomic write unit is a power-of-2 factor of the
stripe size.
Currently we use io_min limit to hold the stripe size, and check for a
io_min <= SECTOR_SIZE when deciding if we have a striped stacked device.
Nilay reports that this causes a problem when the physical block size is
greater than SECTOR_SIZE [0].
Furthermore, io_min may be mutated when stacking devices, and this makes
it a poor candidate to hold the stripe size. Such an example (of when
io_min may change) would be when the io_min is less than the physical
block size.
Use chunk_sectors to hold the stripe size, which is more appropriate.
[0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
block/blk-settings.c | 51 +++++++++++++++++++++++++-------------------
1 file changed, 29 insertions(+), 22 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 7ca21fb32598..20d3563f5d3f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t,
return true;
}
+static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
+{
+ return 1 << (ffs(nr) - 1);
+}
-/* Check stacking of first bottom device */
-static bool blk_stack_atomic_writes_head(struct queue_limits *t,
- struct queue_limits *b)
+static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t)
{
- if (b->atomic_write_hw_boundary &&
- !blk_stack_atomic_writes_boundary_head(t, b))
- return false;
+ unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT;
- if (t->io_min <= SECTOR_SIZE) {
- /* No chunk sectors, so use bottom device values directly */
- t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
- t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
- t->atomic_write_hw_max = b->atomic_write_hw_max;
- return true;
- }
+ if (!t->chunk_sectors)
+ return;
/*
* Find values for limits which work for chunk size.
* b->atomic_write_hw_unit_{min, max} may not be aligned with chunk
- * size (t->io_min), as chunk size is not restricted to a power-of-2.
+ * size, as the chunk size is not restricted to a power-of-2.
* So we need to find highest power-of-2 which works for the chunk
* size.
- * As an example scenario, we could have b->unit_max = 16K and
- * t->io_min = 24K. For this case, reduce t->unit_max to a value
- * aligned with both limits, i.e. 8K in this example.
+ * As an example scenario, we could have t->unit_max = 16K and
+ * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a
+ * value aligned with both limits, i.e. 8K in this example.
*/
- t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
- while (t->io_min % t->atomic_write_hw_unit_max)
- t->atomic_write_hw_unit_max /= 2;
+ t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max,
+ max_pow_of_two_factor(chunk_bytes));
- t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min,
+ t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min,
t->atomic_write_hw_unit_max);
- t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min);
+ t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes);
+}
+/* Check stacking of first bottom device */
+static bool blk_stack_atomic_writes_head(struct queue_limits *t,
+ struct queue_limits *b)
+{
+ if (b->atomic_write_hw_boundary &&
+ !blk_stack_atomic_writes_boundary_head(t, b))
+ return false;
+
+ t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max;
+ t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min;
+ t->atomic_write_hw_max = b->atomic_write_hw_max;
return true;
}
@@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t,
if (!blk_stack_atomic_writes_head(t, b))
goto unsupported;
+ blk_stack_atomic_writes_chunk_sectors(t);
return;
unsupported:
--
2.43.5
On Thu, 3 Jul 2025, John Garry wrote: > The atomic write unit max value is limited by any stacked device stripe > size. > > It is required that the atomic write unit is a power-of-2 factor of the > stripe size. > > Currently we use io_min limit to hold the stripe size, and check for a > io_min <= SECTOR_SIZE when deciding if we have a striped stacked device. > > Nilay reports that this causes a problem when the physical block size is > greater than SECTOR_SIZE [0]. > > Furthermore, io_min may be mutated when stacking devices, and this makes > it a poor candidate to hold the stripe size. Such an example (of when > io_min may change) would be when the io_min is less than the physical > block size. > > Use chunk_sectors to hold the stripe size, which is more appropriate. > > [0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781 > > Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> > Signed-off-by: John Garry <john.g.garry@oracle.com> > --- > block/blk-settings.c | 51 +++++++++++++++++++++++++------------------- > 1 file changed, 29 insertions(+), 22 deletions(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index 7ca21fb32598..20d3563f5d3f 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t, > return true; > } > > +static inline unsigned int max_pow_of_two_factor(const unsigned int nr) > +{ > + return 1 << (ffs(nr) - 1); This could be changed to "nr & -nr". > +} > > -/* Check stacking of first bottom device */ > -static bool blk_stack_atomic_writes_head(struct queue_limits *t, > - struct queue_limits *b) > +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t) > { > - if (b->atomic_write_hw_boundary && > - !blk_stack_atomic_writes_boundary_head(t, b)) > - return false; > + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT; What about integer overflow? > - if (t->io_min <= SECTOR_SIZE) { > - /* No chunk sectors, so use bottom device values directly */ > - t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max; > - t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min; > - t->atomic_write_hw_max = b->atomic_write_hw_max; > - return true; > - } > + if (!t->chunk_sectors) > + return; > > /* > * Find values for limits which work for chunk size. > * b->atomic_write_hw_unit_{min, max} may not be aligned with chunk > - * size (t->io_min), as chunk size is not restricted to a power-of-2. > + * size, as the chunk size is not restricted to a power-of-2. > * So we need to find highest power-of-2 which works for the chunk > * size. > - * As an example scenario, we could have b->unit_max = 16K and > - * t->io_min = 24K. For this case, reduce t->unit_max to a value > - * aligned with both limits, i.e. 8K in this example. > + * As an example scenario, we could have t->unit_max = 16K and > + * t->chunk_sectors = 24KB. For this case, reduce t->unit_max to a > + * value aligned with both limits, i.e. 8K in this example. > */ > - t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max; > - while (t->io_min % t->atomic_write_hw_unit_max) > - t->atomic_write_hw_unit_max /= 2; > + t->atomic_write_hw_unit_max = min(t->atomic_write_hw_unit_max, > + max_pow_of_two_factor(chunk_bytes)); > > - t->atomic_write_hw_unit_min = min(b->atomic_write_hw_unit_min, > + t->atomic_write_hw_unit_min = min(t->atomic_write_hw_unit_min, > t->atomic_write_hw_unit_max); > - t->atomic_write_hw_max = min(b->atomic_write_hw_max, t->io_min); > + t->atomic_write_hw_max = min(t->atomic_write_hw_max, chunk_bytes); > +} > > +/* Check stacking of first bottom device */ > +static bool blk_stack_atomic_writes_head(struct queue_limits *t, > + struct queue_limits *b) > +{ > + if (b->atomic_write_hw_boundary && > + !blk_stack_atomic_writes_boundary_head(t, b)) > + return false; > + > + t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max; > + t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min; > + t->atomic_write_hw_max = b->atomic_write_hw_max; > return true; > } > > @@ -658,6 +664,7 @@ static void blk_stack_atomic_writes_limits(struct queue_limits *t, > > if (!blk_stack_atomic_writes_head(t, b)) > goto unsupported; > + blk_stack_atomic_writes_chunk_sectors(t); > return; > > unsupported: > -- > 2.43.5 >
On 03/07/2025 14:31, Mikulas Patocka wrote: > > > On Thu, 3 Jul 2025, John Garry wrote: > >> The atomic write unit max value is limited by any stacked device stripe >> size. >> >> It is required that the atomic write unit is a power-of-2 factor of the >> stripe size. >> >> Currently we use io_min limit to hold the stripe size, and check for a >> io_min <= SECTOR_SIZE when deciding if we have a striped stacked device. >> >> Nilay reports that this causes a problem when the physical block size is >> greater than SECTOR_SIZE [0]. >> >> Furthermore, io_min may be mutated when stacking devices, and this makes >> it a poor candidate to hold the stripe size. Such an example (of when >> io_min may change) would be when the io_min is less than the physical >> block size. >> >> Use chunk_sectors to hold the stripe size, which is more appropriate. >> >> [0] https://urldefense.com/v3/__https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/*mecca17129f72811137d3c2f1e477634e77f06781__;Iw!!ACWV5N9M2RV99hQ!OoKnbVR6yxyDj7-7bpZceNOD59hud0wfw_-fZLPgcGi9XdFQyfpfFFmbYzR_HdvM8epaJqe_dCGnIEgDPMze$ >> >> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> >> Signed-off-by: John Garry <john.g.garry@oracle.com> >> --- >> block/blk-settings.c | 51 +++++++++++++++++++++++++------------------- >> 1 file changed, 29 insertions(+), 22 deletions(-) >> >> diff --git a/block/blk-settings.c b/block/blk-settings.c >> index 7ca21fb32598..20d3563f5d3f 100644 >> --- a/block/blk-settings.c >> +++ b/block/blk-settings.c >> @@ -596,41 +596,47 @@ static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t, >> return true; >> } >> >> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr) >> +{ >> + return 1 << (ffs(nr) - 1); > > This could be changed to "nr & -nr". Sure, but I doubt if that is a more natural form. > >> +} >> >> -/* Check stacking of first bottom device */ >> -static bool blk_stack_atomic_writes_head(struct queue_limits *t, >> - struct queue_limits *b) >> +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t) >> { >> - if (b->atomic_write_hw_boundary && >> - !blk_stack_atomic_writes_boundary_head(t, b)) >> - return false; >> + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT; > > What about integer overflow? I suppose theoretically it could happen, and I'm happy to change. However there seems to be precedent in assuming it won't: - in stripe_op_hints(), we hold chunk_size in an unsigned int - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min, which is an unsigned int type. Please let me know your thoughts on also changing these sort of instances. Is it realistic to expect chunk_bytes > UINT_MAX? Thanks, John
On Thu, 3 Jul 2025, John Garry wrote: > > > -/* Check stacking of first bottom device */ > > > -static bool blk_stack_atomic_writes_head(struct queue_limits *t, > > > - struct queue_limits *b) > > > +static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t) > > > { > > > - if (b->atomic_write_hw_boundary && > > > - !blk_stack_atomic_writes_boundary_head(t, b)) > > > - return false; > > > + unsigned int chunk_bytes = t->chunk_sectors << SECTOR_SHIFT; > > > > What about integer overflow? > > I suppose theoretically it could happen, and I'm happy to change. > > However there seems to be precedent in assuming it won't: > > - in stripe_op_hints(), we hold chunk_size in an unsigned int > - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min, > which is an unsigned int type. > > Please let me know your thoughts on also changing these sort of instances. Is > it realistic to expect chunk_bytes > UINT_MAX? > > Thanks, > John dm-stripe can be created with a stripe size that is more than 0xffffffff bytes. Though, the integer overflow already exists in the existing dm-stripe target: static void stripe_io_hints(struct dm_target *ti, struct queue_limits *limits) { struct stripe_c *sc = ti->private; unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT; limits->io_min = chunk_size; limits->io_opt = chunk_size * sc->stripes; } What should we set there as io_min and io_opt if sc->chunk_size << SECTOR_SHIFT overflows? Should we set nothing? Mikulas
On 03/07/2025 16:36, Mikulas Patocka wrote: >> I suppose theoretically it could happen, and I'm happy to change. >> >> However there seems to be precedent in assuming it won't: >> >> - in stripe_op_hints(), we hold chunk_size in an unsigned int >> - in raid0_set_limits(), we hold mddev->chunk_sectors << 9 in lim.io_min, >> which is an unsigned int type. >> >> Please let me know your thoughts on also changing these sort of instances. Is >> it realistic to expect chunk_bytes > UINT_MAX? >> >> Thanks, >> John > dm-stripe can be created with a stripe size that is more than 0xffffffff > bytes. > > Though, the integer overflow already exists in the existing dm-stripe > target: > static void stripe_io_hints(struct dm_target *ti, > struct queue_limits *limits) > { > struct stripe_c *sc = ti->private; > unsigned int chunk_size = sc->chunk_size << SECTOR_SHIFT; > > limits->io_min = chunk_size; > limits->io_opt = chunk_size * sc->stripes; > } > What should we set there as io_min and io_opt if sc->chunk_size << > SECTOR_SHIFT overflows? > Should we set nothing? For io_min/opt, maybe reduce to a factor of the stripe size / width (and which fits in a unsigned int). I am not sure if it is even sane to have such huge values in io_min and the bottom disk io_min should be used directly instead. Martin Petersen might have a better idea.. he added those sysfs files :) Thanks, John
John, > For io_min/opt, maybe reduce to a factor of the stripe size / width > (and which fits in a unsigned int). > > I am not sure if it is even sane to have such huge values in io_min > and the bottom disk io_min should be used directly instead. The intent for io_min was to convey the physical_block_size in the case of an individual drive. And for it to be set to the stripe chunk size in stacking scenarios that would otherwise involve read-modify-write (i.e. RAID5 and RAID6). io_opt was meant to communicate the stripe width. Reporting very large values for io_opt is generally counterproductive since we can't write multiple gigabytes in a single operation anyway. logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors -- Martin K. Petersen
On 09/07/2025 02:39, Martin K. Petersen wrote: > The intent for io_min was to convey the physical_block_size in the case > of an individual drive. And for it to be set to the stripe chunk size in > stacking scenarios that would otherwise involve read-modify-write (i.e. > RAID5 and RAID6). > > io_opt was meant to communicate the stripe width. Reporting very large > values for io_opt is generally counterproductive since we can't write > multiple gigabytes in a single operation anyway. > > logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors Does pbs need to be a power-of-2? The block queue limits and splitting code seems to rely on that, but it is not policed AFAICS. The stacking code seems to just want it to be a multiple of lbs, which itself must be a power-of-2. Thanks, John
John, > Does pbs need to be a power-of-2? For ATA and SCSI, definitely. NVMe could technically report a multiple of the logical block size but I've never come across anything that didn't report a power of two. -- Martin K. Petersen
On 09/07/2025 02:39, Martin K. Petersen wrote: > > John, > >> For io_min/opt, maybe reduce to a factor of the stripe size / width >> (and which fits in a unsigned int). >> >> I am not sure if it is even sane to have such huge values in io_min >> and the bottom disk io_min should be used directly instead. > > The intent for io_min was to convey the physical_block_size in the case > of an individual drive. And for it to be set to the stripe chunk size in > stacking scenarios that would otherwise involve read-modify-write (i.e. > RAID5 and RAID6). And the same is done for md raid0/dm stripe; I suppose that the idea is to encourage larger than chunk size writes to see the performance advantage in striping. > > io_opt was meant to communicate the stripe width. Reporting very large > values for io_opt is generally counterproductive since we can't write > multiple gigabytes in a single operation anyway. > Right, and so it seems counterproductive to have chunk size much bigger than bottom device io_opt > logical <= physical <= io_min <= io_opt <= max_sectors <= max_hw_sectors >
© 2016 - 2025 Red Hat, Inc.