block/blk-settings.c | 61 ++++++++++++++++++++++++++---------------- drivers/md/dm-stripe.c | 1 + drivers/md/raid0.c | 1 + drivers/md/raid10.c | 1 + fs/xfs/xfs_mount.c | 5 ---- include/linux/log2.h | 14 ++++++++++ 6 files changed, 55 insertions(+), 28 deletions(-)
This value in io_min is used to configure any atomic write limit for the stacked device. The idea is that the atomic write unit max is a power-of-2 factor of the stripe size, and the stripe size is available in io_min. Using io_min causes issues, as: a. it may be mutated b. the check for io_min being set for determining if we are dealing with a striped device is hard to get right, as reported in [0]. This series now sets chunk_sectors limit to share stripe size. [0] https://lore.kernel.org/linux-block/888f3b1d-7817-4007-b3b3-1a2ea04df771@linux.ibm.com/T/#mecca17129f72811137d3c2f1e477634e77f06781 Based on 8b428f42f3edf nbd: fix lockdep deadlock warning This series fixes issues for v6.16, but it's prob better to have this in v6.17 . Differences to v5: - Neaten code in blk_validate_atomic_write_limits() (Jens) Differences to v4: - Use check_shl_overflow() (Nilay) - Use long long in for chunk bytes in 2/6 - Add tags from Nilay (thanks!) Differences to v3: - relocate max_pow_of_two_factor() to common header and rework (Mikulas) - cater for overflow from chunk sectors (Mikulas) John Garry (6): ilog2: add max_pow_of_two_factor() block: sanitize chunk_sectors for atomic write limits md/raid0: set chunk_sectors limit md/raid10: set chunk_sectors limit dm-stripe: limit chunk_sectors to the stripe size block: use chunk_sectors when evaluating stacked atomic write limits block/blk-settings.c | 61 ++++++++++++++++++++++++++---------------- drivers/md/dm-stripe.c | 1 + drivers/md/raid0.c | 1 + drivers/md/raid10.c | 1 + fs/xfs/xfs_mount.c | 5 ---- include/linux/log2.h | 14 ++++++++++ 6 files changed, 55 insertions(+), 28 deletions(-) -- 2.43.5
On 7/11/25 5:09 PM, John Garry wrote: > This value in io_min is used to configure any atomic write limit for the > stacked device. The idea is that the atomic write unit max is a > power-of-2 factor of the stripe size, and the stripe size is available > in io_min. > > Using io_min causes issues, as: > a. it may be mutated > b. the check for io_min being set for determining if we are dealing with > a striped device is hard to get right, as reported in [0]. > > This series now sets chunk_sectors limit to share stripe size. Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe if we are dealing with a zoned block device that also supports atomic writes ? Not that I know of any such device, but better be safe, so maybe for now do not enable atomic write support on zoned devices ? -- Damien Le Moal Western Digital Research
On Fri, Jul 11, 2025 at 05:44:26PM +0900, Damien Le Moal wrote: > On 7/11/25 5:09 PM, John Garry wrote: > > This value in io_min is used to configure any atomic write limit for the > > stacked device. The idea is that the atomic write unit max is a > > power-of-2 factor of the stripe size, and the stripe size is available > > in io_min. > > > > Using io_min causes issues, as: > > a. it may be mutated > > b. the check for io_min being set for determining if we are dealing with > > a striped device is hard to get right, as reported in [0]. > > > > This series now sets chunk_sectors limit to share stripe size. > > Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe > if we are dealing with a zoned block device that also supports atomic writes ? Btw, I wonder if it's time to decouple the zone size from the chunk size eventually. It seems like a nice little hack, but with things like parity raid for zoned devices now showing up at least in academia, and nvme devices reporting chunk sizes the overload might not be that good any more. > Not that I know of any such device, but better be safe, so maybe for now > do not enable atomic write support on zoned devices ? How would atomic writes make sense for zone devices? Because all writes up to the reported write pointer must be valid, there usual checks for partial updates a lacking, so the only use would be to figure out if a write got truncated. At least for file systems we detects this using the fs metadata that must be written on I/O completion anyway, so the only user would be an application with some sort of speculative writes that can't detect partial writes. Which sounds rather fringe and dangerous. Now we should be able to implement the software atomic writes pretty easily for zoned XFS, and funnily they might actually be slightly faster than normal writes due to the transaction batching. Now that we're getting reasonable test coverage we should be able to give it a spin, but I have a few too many things on my plate at the moment.
On 14/07/2025 06:53, Christoph Hellwig wrote: > Now we should be able to implement the software atomic writes pretty > easily for zoned XFS, and funnily they might actually be slightly faster > than normal writes due to the transaction batching. Now that we're > getting reasonable test coverage we should be able to give it a spin, but > I have a few too many things on my plate at the moment. Isn't reflink currently incompatible with zoned xfs? I don't think that there is even anything else needed to automatically get software-based atomics support for zoned xfs.
On Mon, Jul 14, 2025 at 08:52:39AM +0100, John Garry wrote: > On 14/07/2025 06:53, Christoph Hellwig wrote: >> Now we should be able to implement the software atomic writes pretty >> easily for zoned XFS, and funnily they might actually be slightly faster >> than normal writes due to the transaction batching. Now that we're >> getting reasonable test coverage we should be able to give it a spin, but >> I have a few too many things on my plate at the moment. > > Isn't reflink currently incompatible with zoned xfs? reflink itself yes due to the garbage collection algorithm that is not reflink aware. But all I/O on zoned file RT device uses the same I/O path design as writes that unshare reflinks because it always has to write out of place.
On 2025/07/14 14:53, Christoph Hellwig wrote: > On Fri, Jul 11, 2025 at 05:44:26PM +0900, Damien Le Moal wrote: >> On 7/11/25 5:09 PM, John Garry wrote: >>> This value in io_min is used to configure any atomic write limit for the >>> stacked device. The idea is that the atomic write unit max is a >>> power-of-2 factor of the stripe size, and the stripe size is available >>> in io_min. >>> >>> Using io_min causes issues, as: >>> a. it may be mutated >>> b. the check for io_min being set for determining if we are dealing with >>> a striped device is hard to get right, as reported in [0]. >>> >>> This series now sets chunk_sectors limit to share stripe size. >> >> Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe >> if we are dealing with a zoned block device that also supports atomic writes ? > > Btw, I wonder if it's time to decouple the zone size from the chunk > size eventually. It seems like a nice little hack, but with things > like parity raid for zoned devices now showing up at least in academia, > and nvme devices reporting chunk sizes the overload might not be that > good any more. Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs attribute file is reporting the zone size today. Changing that may break applications. So I am not sure if we can actually do that, unless the sysfs interface is considered as "unstable" ? > >> Not that I know of any such device, but better be safe, so maybe for now >> do not enable atomic write support on zoned devices ? > > How would atomic writes make sense for zone devices? Because all writes > up to the reported write pointer must be valid, there usual checks for > partial updates a lacking, so the only use would be to figure out if a > write got truncated. At least for file systems we detects this using the > fs metadata that must be written on I/O completion anyway, so the only > user would be an application with some sort of speculative writes that > can't detect partial writes. Which sounds rather fringe and dangerous. The only thing I can think of which would make sense is to avoid torn writes with SAS drives. But in itself, that is extremely niche. > > Now we should be able to implement the software atomic writes pretty > easily for zoned XFS, and funnily they might actually be slightly faster > than normal writes due to the transaction batching. Now that we're > getting reasonable test coverage we should be able to give it a spin, but > I have a few too many things on my plate at the moment. -- Damien Le Moal Western Digital Research
On Mon, Jul 14, 2025 at 03:00:57PM +0900, Damien Le Moal wrote: > Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs > attribute file is reporting the zone size today. Changing that may break > applications. So I am not sure if we can actually do that, unless the sysfs > interface is considered as "unstable" ? Good point. I don't think it is considered unstable.
On 7/14/25 08:13, Christoph Hellwig wrote: > On Mon, Jul 14, 2025 at 03:00:57PM +0900, Damien Le Moal wrote: >> Agreed, it would be nice to clean that up. BUT, the chunk_sectors sysfs >> attribute file is reporting the zone size today. Changing that may break >> applications. So I am not sure if we can actually do that, unless the sysfs >> interface is considered as "unstable" ? > > Good point. I don't think it is considered unstable. > Hmm. It does, but really the meaning of 'chunk_sectors' (ie a boundary which I/O requests may not cross) hasn't changed. And that's also the original use-case for the mapping of zone size to chunk_sectors, namely to ensure that the block layer generates valid I/O. So from that standpoint I guess we can change it; in the end, there may (and will) be setups where 'chunk_sectors' is smaller than the zone size. We would need to have another attribute for the zone size, though :-) But arguably we should have that even if we don't follow the above reasoning. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
On 11/07/2025 09:44, Damien Le Moal wrote: >> This series now sets chunk_sectors limit to share stripe size. > Hmm... chunk_sectors for a zoned device is the zone size. So is this all safe > if we are dealing with a zoned block device that also supports atomic writes ? > Not that I know of any such device, but better be safe, so maybe for now do not > enable atomic write support on zoned devices ? I don't think that we need to do anything specific there. Patch 1/6 catches if the chunk size is less than the atomic write max size. Having said that, if a zoned device did support atomic writes then it would be very odd to have its atomic write max size > zone size anyway. Thanks, John
© 2016 - 2025 Red Hat, Inc.