.../filesystems/ext4/atomic_writes.rst | 4 +- block/bdev.c | 7 +- fs/ext4/inode.c | 9 +- fs/iomap/buffered-io.c | 395 ++++++++++++++++-- fs/iomap/ioend.c | 21 +- fs/iomap/trace.h | 12 +- fs/read_write.c | 3 - fs/stat.c | 33 +- fs/xfs/xfs_file.c | 9 +- fs/xfs/xfs_iops.c | 127 +++--- fs/xfs/xfs_iops.h | 6 +- include/linux/fs.h | 3 +- include/linux/iomap.h | 3 + include/linux/page-flags.h | 5 + include/trace/events/mmflags.h | 3 +- include/trace/misc/fs.h | 3 +- include/uapi/linux/stat.h | 10 +- tools/include/uapi/linux/stat.h | 10 +- .../trace/beauty/include/uapi/linux/stat.h | 10 +- 19 files changed, 551 insertions(+), 122 deletions(-)
This patch adds support to perform single block RWF_ATOMIC writes for
iomap xfs buffered IO. This builds upon the inital RFC shared by John
Garry last year [1]. Most of the details are present in the respective
commit messages but I'd mention some of the design points below:
1. The first 4 patches introduce the statx and iomap plubming and page
flags to add basic atomic writes support to buffered IO. However, there
are still 2 key restrictions that apply:
FIRST: If the user buffer of atomic write crosses page boundary, there's a
possibility of short write, example if 1 user page could not be faulted or got
reclaimed before the copy operation. For now don't allow such a scenario by
ensuring user buffer is page aligned. This way either the full write goes
through or nothing does. This is also discussed in Mathew Wilcox's comment here
[2]
This is lifted in patch 5. The approach we took was to:
1. pin the user pages
2. Create a BVEC out of the struct page to pass to
copy_folio_from_iter_atomic() rather than the USER backed iter. We
don't use the user iter directly because the pinned user page could
still get unmapped from the process, leading to short writes.
This approach allows us to only proceed if we are sure we will not have a short
copy.
SECOND: We only support block size == page size buf-io atomic writes.
This is to avoid the following scenario:
1. 4kb block atomic write marks the complete 64kb folio as
atomic.
2. Other writes, dirty the whole 64kb folio.
3. Writeback sees the whole folio dirty and atomic and tries
to send a 64kb atomic write, which might exceed the
allowed atomic write size and fail.
Patch 7 adds support for sub-page atomic write tracking to remove this
restriction. We do this by adding 2 more bitmaps to ifs to track atomic
write start and end.
Lastly, a non atomic write over an atomic write will remove the atomic
guarantee. Userspace is expected to make sure to sync the data to disk
after an atomic write before performing any overwrites.
This series has survived -g quick xfstests and I'll be continuing to
test it. Just wanted to put out the RFC to get some reviews on the
design and suggestions on any better approaches.
[1] https://lore.kernel.org/all/20240422143923.3927601-1-john.g.garry@oracle.com/
[2] https://lore.kernel.org/all/ZiZ8XGZz46D3PRKr@casper.infradead.org/
Thanks,
Ojaswin
John Garry (2):
fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
mm: Add PG_atomic
Ojaswin Mujoo (6):
fs: Add initial buffered atomic write support info to statx
iomap: buffered atomic write support
iomap: pin pages for RWF_ATOMIC buffered write
xfs: Report atomic write min and max for buf io as well
iomap: Add bs<ps buffered atomic writes support
xfs: Lift the bs == ps restriction for HW buffered atomic writes
.../filesystems/ext4/atomic_writes.rst | 4 +-
block/bdev.c | 7 +-
fs/ext4/inode.c | 9 +-
fs/iomap/buffered-io.c | 395 ++++++++++++++++--
fs/iomap/ioend.c | 21 +-
fs/iomap/trace.h | 12 +-
fs/read_write.c | 3 -
fs/stat.c | 33 +-
fs/xfs/xfs_file.c | 9 +-
fs/xfs/xfs_iops.c | 127 +++---
fs/xfs/xfs_iops.h | 6 +-
include/linux/fs.h | 3 +-
include/linux/iomap.h | 3 +
include/linux/page-flags.h | 5 +
include/trace/events/mmflags.h | 3 +-
include/trace/misc/fs.h | 3 +-
include/uapi/linux/stat.h | 10 +-
tools/include/uapi/linux/stat.h | 10 +-
.../trace/beauty/include/uapi/linux/stat.h | 10 +-
19 files changed, 551 insertions(+), 122 deletions(-)
--
2.51.0
On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > This patch adds support to perform single block RWF_ATOMIC writes for > iomap xfs buffered IO. This builds upon the inital RFC shared by John > Garry last year [1]. Most of the details are present in the respective > commit messages but I'd mention some of the design points below: What is the use case for this functionality? i.e. what is the reason for adding all this complexity? -Dave. -- Dave Chinner david@fromorbit.com
On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: > On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > > This patch adds support to perform single block RWF_ATOMIC writes for > > iomap xfs buffered IO. This builds upon the inital RFC shared by John > > Garry last year [1]. Most of the details are present in the respective > > commit messages but I'd mention some of the design points below: > > What is the use case for this functionality? i.e. what is the > reason for adding all this complexity? Seconded. The atomic code has a lot of complexity, and further mixing it with buffered I/O makes this even worse. We'd need a really important use case to even consider it.
Christoph Hellwig <hch@lst.de> writes: > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: >> > This patch adds support to perform single block RWF_ATOMIC writes for >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John >> > Garry last year [1]. Most of the details are present in the respective >> > commit messages but I'd mention some of the design points below: >> >> What is the use case for this functionality? i.e. what is the >> reason for adding all this complexity? > > Seconded. The atomic code has a lot of complexity, and further mixing > it with buffered I/O makes this even worse. We'd need a really important > use case to even consider it. I agree this should have been in the cover letter itself. I believe the reason for adding this functionality was also discussed at LSFMM too... For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about Postgres folks looking for this, since PostgreSQL databases uses buffered I/O for their database writes. -ritesh
On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote: > Christoph Hellwig <hch@lst.de> writes: > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > >> > This patch adds support to perform single block RWF_ATOMIC writes for > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John > >> > Garry last year [1]. Most of the details are present in the respective > >> > commit messages but I'd mention some of the design points below: > >> > >> What is the use case for this functionality? i.e. what is the > >> reason for adding all this complexity? > > > > Seconded. The atomic code has a lot of complexity, and further mixing > > it with buffered I/O makes this even worse. We'd need a really important > > use case to even consider it. > > I agree this should have been in the cover letter itself. > > I believe the reason for adding this functionality was also discussed at > LSFMM too... > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about > Postgres folks looking for this, since PostgreSQL databases uses > buffered I/O for their database writes. Pointing at a discussion about how "this application has some ideas on how it can maybe use it someday in the future" isn't a particularly good justification. This still sounds more like a research project than something a production system needs right now. Why didn't you use the existing COW buffered write IO path to implement atomic semantics for buffered writes? The XFS functionality is already all there, and it doesn't require any changes to the page cache or iomap to support... -Dave. -- Dave Chinner david@fromorbit.com
On Thu, Nov 13, 2025 at 09:32:11PM +1100, Dave Chinner wrote: > On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote: > > Christoph Hellwig <hch@lst.de> writes: > > > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: > > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > > >> > This patch adds support to perform single block RWF_ATOMIC writes for > > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John > > >> > Garry last year [1]. Most of the details are present in the respective > > >> > commit messages but I'd mention some of the design points below: > > >> > > >> What is the use case for this functionality? i.e. what is the > > >> reason for adding all this complexity? > > > > > > Seconded. The atomic code has a lot of complexity, and further mixing > > > it with buffered I/O makes this even worse. We'd need a really important > > > use case to even consider it. > > > > I agree this should have been in the cover letter itself. > > > > I believe the reason for adding this functionality was also discussed at > > LSFMM too... > > > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about > > Postgres folks looking for this, since PostgreSQL databases uses > > buffered I/O for their database writes. > > Pointing at a discussion about how "this application has some ideas > on how it can maybe use it someday in the future" isn't a > particularly good justification. This still sounds more like a > research project than something a production system needs right now. Hi Dave, Christoph, There were some discussions around use cases for buffered atomic writes in the previous LSFMM covered by LWN here [1]. AFAIK, there are databases that recommend/prefer buffered IO over direct IO. As mentioned in the article, MongoDB being one that supports both but recommends buffered IO. Further, many DBs support both direct IO and buffered IO well and it may not be fair to force them to stick to direct IO to get the benefits of atomic writes. [1] https://lwn.net/Articles/1016015/ > > Why didn't you use the existing COW buffered write IO path to > implement atomic semantics for buffered writes? The XFS > functionality is already all there, and it doesn't require any > changes to the page cache or iomap to support... This patch set focuses on HW accelerated single block atomic writes with buffered IO, to get some early reviews on the core design. Just like we did for direct IO atomic writes, the software fallback with COW and multi block support can be added eventually. Regards, ojaswin > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Nov 14, 2025 at 02:50:25PM +0530, Ojaswin Mujoo wrote: > On Thu, Nov 13, 2025 at 09:32:11PM +1100, Dave Chinner wrote: > > On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote: > > > Christoph Hellwig <hch@lst.de> writes: > > > > > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: > > > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > > > >> > This patch adds support to perform single block RWF_ATOMIC writes for > > > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John > > > >> > Garry last year [1]. Most of the details are present in the respective > > > >> > commit messages but I'd mention some of the design points below: > > > >> > > > >> What is the use case for this functionality? i.e. what is the > > > >> reason for adding all this complexity? > > > > > > > > Seconded. The atomic code has a lot of complexity, and further mixing > > > > it with buffered I/O makes this even worse. We'd need a really important > > > > use case to even consider it. > > > > > > I agree this should have been in the cover letter itself. > > > > > > I believe the reason for adding this functionality was also discussed at > > > LSFMM too... > > > > > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about > > > Postgres folks looking for this, since PostgreSQL databases uses > > > buffered I/O for their database writes. > > > > Pointing at a discussion about how "this application has some ideas > > on how it can maybe use it someday in the future" isn't a > > particularly good justification. This still sounds more like a > > research project than something a production system needs right now. > > Hi Dave, Christoph, > > There were some discussions around use cases for buffered atomic writes > in the previous LSFMM covered by LWN here [1]. AFAIK, there are > databases that recommend/prefer buffered IO over direct IO. As mentioned > in the article, MongoDB being one that supports both but recommends > buffered IO. Further, many DBs support both direct IO and buffered IO > well and it may not be fair to force them to stick to direct IO to get > the benefits of atomic writes. > > [1] https://lwn.net/Articles/1016015/ You are quoting a discussion about atomic writes that was held without any XFS developers present. Given how XFS has driven atomic write functionality so far, XFS developers might have some ..... opinions about how buffered atomic writes in XFS... Indeed, go back to the 2024 buffered atomic IO LSFMM discussion, where there were XFS developers present. That's the discussion that Ritesh referenced, so you should be aware of it. https://lwn.net/Articles/974578/ Back then I talked about how atomic writes made no sense as -writeback IO- given the massive window for anything else to modify the data in the page cache. There is no guarantee that what the application wrote in the syscall is what gets written to disk with writeback IO. i.e. anything that can access the page cache can "tear" application data that is staged as "atomic data" for later writeback. IOWs, the concept of atomic writes for writeback IO makes almost no sense at all - dirty data at rest in the page cache is not protected against 3rd party access or modification. The "atomic data IO" semantics can only exist in the submitting IO context where exclusive access to the user data can be guaranteed. IMO, the only way semantics that makes sense for buffered atomic writes through the page cache is write-through IO. The "atomic" context is related directly to user data provided at IO submission, and so IO submitted must guarantee exactly that data is being written to disk in that IO. IOWs, we have to guarantee exclusive access between the data copy-in and the pages being marked for writeback. The mapping needs to be marked as using stable pages to prevent anyone else changing the cached data whilst it has an atomic IO pending on it. That means folios covering atomic IO ranges do not sit in the page cache in a dirty state - they *must* immediately transition to the writeback state before the folio is unlocked so that *nothing else can modify them* before the physical REQ_ATOMIC IO is submitted and completed. If we've got the folios marked as writeback, we can pack them immediately into a bio and submit the IO (e.g. via the iomap DIO code). There is no need to involve the buffered IO writeback path here; we've already got the folios at hand and in the right state for IO. Once the IO is done, we end writeback on them and they remain clean in the page caceh for anyone else to access and modify... This gives us the same physical IO semantics for buffered and direct atomic IO, and it allows the same software fallbacks for larger IO to be used as well. > > Why didn't you use the existing COW buffered write IO path to > > implement atomic semantics for buffered writes? The XFS > > functionality is already all there, and it doesn't require any > > changes to the page cache or iomap to support... > > This patch set focuses on HW accelerated single block atomic writes with > buffered IO, to get some early reviews on the core design. What hardware acceleration? Hardware atomic writes are do not make IO faster; they only change IO failure semantics in certain corner cases. Making buffered writeback IO use REQ_ATOMIC does not change the failure semantics of buffered writeback from the point of view of an application; the applicaiton still has no idea just how much data or what files lost data whent eh system crashes. Further, writeback does not retain application write ordering, so the application also has no control over the order that structured data is updated on physical media. Hence if the application needs specific IO ordering for crash recovery (e.g. to avoid using a WAL) it cannot use background buffered writeback for atomic writes because that does not guarantee ordering. What happens when you do two atomic buffered writes to the same file range? The second on hits the page cache, so now the crash recovery semantic is no longer "old or new", it's "some random older version or new". If the application rewrites a range frequently enough, on-disk updates could skip dozens of versions between "old" and "new", whilst other ranges of the file move one version at a time. The application has -zero control- of this behaviour because it is background writeback that determines when something gets written to disk, not the application. IOWs, the only way to guarantee single version "old or new" atomic buffered overwrites for any given write would be to force flushing of the data post-write() completion. That means either O_DSYNC, fdatasync() or sync_file_range(). And this turns the atomic writes into -write-through- IO, not write back IO... > Just like we did for direct IO atomic writes, the software fallback with > COW and multi block support can be added eventually. If the reason for this functionality is "maybe someone can use it in future", then you're not implementing this functionality to optimise an existing workload. It's a research project looking for a user. Work with the database engineers to build a buffered atomic write based engine that implements atomic writes with RWF_DSYNC. Make it work, and optimise it to be competitive with existing database engines, than then show how much faster it is using RWF_ATOMIC buffered writes. Alternatively - write an algorithm that assumes the filesystem is using COW for overwrites, and optimise the data integrity algorithm based on this knowledge. e.g. use always-cow mode on XFS, or just optimise for normal bcachefs or btrfs buffered writes. Use O_DSYNC when completion to submission ordering is required. Now you have an application algorithm that is optimised for old-or-new behaviour, and that can then be acclerated on overwrite-in-place capable filesystems by using a direct-to-hw REQ_ATOMIC overwrite to provide old-or-new semantics instead of using COW. Yes, there are corner cases - partial writeback, fragmented files, etc - where data will a mix of old and new when using COW without RWF_DSYNC. Those are the the cases that RWF_ATOMIC needs to mitigate, but we don't need whacky page cache and writeback stuff to implement RWF_ATOMIC semantics in COW capable filesystems. i.e. enhance the applicaitons to take advantage of native COW old-or-new data semantics for buffered writes, then we can look at direct-to-hw fast paths to optimise those algorithms. Trying to go direct-to-hw first without having any clue of how applications are going to use such functionality is backwards. Design the applicaiton level code that needs highly performant old-or-new buffered write guarantees, then we can optimise the data paths for it... -Dave. -- Dave Chinner david@fromorbit.com
On Sun, Nov 16, 2025 at 07:11:50PM +1100, Dave Chinner wrote: > On Fri, Nov 14, 2025 at 02:50:25PM +0530, Ojaswin Mujoo wrote: > > On Thu, Nov 13, 2025 at 09:32:11PM +1100, Dave Chinner wrote: > > > On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote: > > > > Christoph Hellwig <hch@lst.de> writes: > > > > > > > > > On Thu, Nov 13, 2025 at 08:56:56AM +1100, Dave Chinner wrote: > > > > >> On Wed, Nov 12, 2025 at 04:36:03PM +0530, Ojaswin Mujoo wrote: > > > > >> > This patch adds support to perform single block RWF_ATOMIC writes for > > > > >> > iomap xfs buffered IO. This builds upon the inital RFC shared by John > > > > >> > Garry last year [1]. Most of the details are present in the respective > > > > >> > commit messages but I'd mention some of the design points below: > > > > >> > > > > >> What is the use case for this functionality? i.e. what is the > > > > >> reason for adding all this complexity? > > > > > > > > > > Seconded. The atomic code has a lot of complexity, and further mixing > > > > > it with buffered I/O makes this even worse. We'd need a really important > > > > > use case to even consider it. > > > > > > > > I agree this should have been in the cover letter itself. > > > > > > > > I believe the reason for adding this functionality was also discussed at > > > > LSFMM too... > > > > > > > > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about > > > > Postgres folks looking for this, since PostgreSQL databases uses > > > > buffered I/O for their database writes. > > > > > > Pointing at a discussion about how "this application has some ideas > > > on how it can maybe use it someday in the future" isn't a > > > particularly good justification. This still sounds more like a > > > research project than something a production system needs right now. > > > > Hi Dave, Christoph, > > > > There were some discussions around use cases for buffered atomic writes > > in the previous LSFMM covered by LWN here [1]. AFAIK, there are > > databases that recommend/prefer buffered IO over direct IO. As mentioned > > in the article, MongoDB being one that supports both but recommends > > buffered IO. Further, many DBs support both direct IO and buffered IO > > well and it may not be fair to force them to stick to direct IO to get > > the benefits of atomic writes. > > > > [1] https://lwn.net/Articles/1016015/ > > You are quoting a discussion about atomic writes that was > held without any XFS developers present. Given how XFS has driven > atomic write functionality so far, XFS developers might have some > ..... opinions about how buffered atomic writes in XFS... > > Indeed, go back to the 2024 buffered atomic IO LSFMM discussion, > where there were XFS developers present. That's the discussion that > Ritesh referenced, so you should be aware of it. > > https://lwn.net/Articles/974578/ > > Back then I talked about how atomic writes made no sense as > -writeback IO- given the massive window for anything else to modify > the data in the page cache. There is no guarantee that what the > application wrote in the syscall is what gets written to disk with > writeback IO. i.e. anything that can access the page cache can > "tear" application data that is staged as "atomic data" for later > writeback. > > IOWs, the concept of atomic writes for writeback IO makes almost no > sense at all - dirty data at rest in the page cache is not protected > against 3rd party access or modification. The "atomic data IO" > semantics can only exist in the submitting IO context where > exclusive access to the user data can be guaranteed. > > IMO, the only way semantics that makes sense for buffered atomic > writes through the page cache is write-through IO. The "atomic" > context is related directly to user data provided at IO submission, > and so IO submitted must guarantee exactly that data is being > written to disk in that IO. > > IOWs, we have to guarantee exclusive access between the data copy-in > and the pages being marked for writeback. The mapping needs to be > marked as using stable pages to prevent anyone else changing the > cached data whilst it has an atomic IO pending on it. > > That means folios covering atomic IO ranges do not sit in the page > cache in a dirty state - they *must* immediately transition to the > writeback state before the folio is unlocked so that *nothing else > can modify them* before the physical REQ_ATOMIC IO is submitted and > completed. > > If we've got the folios marked as writeback, we can pack them > immediately into a bio and submit the IO (e.g. via the iomap DIO > code). There is no need to involve the buffered IO writeback path > here; we've already got the folios at hand and in the right state > for IO. Once the IO is done, we end writeback on them and they > remain clean in the page caceh for anyone else to access and > modify... Hi Dave, I believe the essenece of your comment is that the data in the page cache can be modified between the write and the writeback time and hence it makes sense to have a write-through only semantic for RWF_ATOMIC buffered IO. However, as per various discussions around this on the mailing list, it is my understanding that protecting tearing against an application changing a data range that was previously written atomically is something that falls out of scope of RWF_ATOMIC. As John pointed out in [1], even with dio, RWF_ATOMIC writes can be torn if the application does parallel overlaps. The only thing we guarantee is the data doesn't tear when the actualy IO happens, and from there its the userspace's responsibility to not change the data till IO [2]. I believe userspace changing data between write and writeback time falls in the same category. [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ [2] https://lore.kernel.org/fstests/20250729144526.GB2672049@frogsfrogsfrogs/ > > This gives us the same physical IO semantics for buffered and direct > atomic IO, and it allows the same software fallbacks for larger IO > to be used as well. > > > > Why didn't you use the existing COW buffered write IO path to > > > implement atomic semantics for buffered writes? The XFS > > > functionality is already all there, and it doesn't require any > > > changes to the page cache or iomap to support... > > > > This patch set focuses on HW accelerated single block atomic writes with > > buffered IO, to get some early reviews on the core design. > > What hardware acceleration? Hardware atomic writes are do not make > IO faster; they only change IO failure semantics in certain corner > cases. Making buffered writeback IO use REQ_ATOMIC does not change > the failure semantics of buffered writeback from the point of view > of an application; the applicaiton still has no idea just how much > data or what files lost data whent eh system crashes. > > Further, writeback does not retain application write ordering, so > the application also has no control over the order that structured > data is updated on physical media. Hence if the application needs > specific IO ordering for crash recovery (e.g. to avoid using a WAL) > it cannot use background buffered writeback for atomic writes > because that does not guarantee ordering. > > What happens when you do two atomic buffered writes to the same file > range? The second on hits the page cache, so now the crash recovery > semantic is no longer "old or new", it's "some random older version > or new". If the application rewrites a range frequently enough, > on-disk updates could skip dozens of versions between "old" and > "new", whilst other ranges of the file move one version at a time. > The application has -zero control- of this behaviour because it is > background writeback that determines when something gets written to > disk, not the application. > > IOWs, the only way to guarantee single version "old or new" atomic > buffered overwrites for any given write would be to force flushing > of the data post-write() completion. That means either O_DSYNC, > fdatasync() or sync_file_range(). And this turns the atomic writes > into -write-through- IO, not write back IO... I agree that there is no ordeirng guarantee without calls to sync and friends, but as with all other IO paths, it has always been the applicatoin that needs to enforce the ordering. Applications like DBs are well aware of this however there are still areas where they can benefit with unordered atomic IO, eg bg write of a bunch of dirty buffers, which only need to be sync'd once during checkpoint. > > > Just like we did for direct IO atomic writes, the software fallback with > > COW and multi block support can be added eventually. > > If the reason for this functionality is "maybe someone > can use it in future", then you're not implementing this > functionality to optimise an existing workload. It's a research > project looking for a user. > > Work with the database engineers to build a buffered atomic write > based engine that implements atomic writes with RWF_DSYNC. > Make it work, and optimise it to be competitive with existing > database engines, than then show how much faster it is using > RWF_ATOMIC buffered writes. > > Alternatively - write an algorithm that assumes the filesystem is > using COW for overwrites, and optimise the data integrity algorithm > based on this knowledge. e.g. use always-cow mode on XFS, or just > optimise for normal bcachefs or btrfs buffered writes. Use O_DSYNC > when completion to submission ordering is required. Now you have > an application algorithm that is optimised for old-or-new behaviour, > and that can then be acclerated on overwrite-in-place capable > filesystems by using a direct-to-hw REQ_ATOMIC overwrite to provide > old-or-new semantics instead of using COW. > > Yes, there are corner cases - partial writeback, fragmented files, > etc - where data will a mix of old and new when using COW without > RWF_DSYNC. Those are the the cases that RWF_ATOMIC needs to > mitigate, but we don't need whacky page cache and writeback stuff to > implement RWF_ATOMIC semantics in COW capable filesystems. > > i.e. enhance the applicaitons to take advantage of native COW > old-or-new data semantics for buffered writes, then we can look at > direct-to-hw fast paths to optimise those algorithms. > > Trying to go direct-to-hw first without having any clue of how > applications are going to use such functionality is backwards. > Design the applicaiton level code that needs highly performant > old-or-new buffered write guarantees, then we can optimise the data > paths for it... Got it, thanks for the pointers Dave, we will look into this. Regards, ojaswin > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On 16/11/2025 08:11, Dave Chinner wrote: >> This patch set focuses on HW accelerated single block atomic writes with >> buffered IO, to get some early reviews on the core design. > What hardware acceleration? Hardware atomic writes are do not make > IO faster; they only change IO failure semantics in certain corner > cases. I think that he references using REQ_ATOMIC-based bio vs xfs software-based atomic writes (which reuse the CoW infrastructure). And the former is considerably faster from my testing (for DIO, obvs). But the latter has not been optimized.
On Mon, Nov 17, 2025 at 10:59:55AM +0000, John Garry wrote: > On 16/11/2025 08:11, Dave Chinner wrote: > > > This patch set focuses on HW accelerated single block atomic writes with > > > buffered IO, to get some early reviews on the core design. > > What hardware acceleration? Hardware atomic writes are do not make > > IO faster; they only change IO failure semantics in certain corner > > cases. > > I think that he references using REQ_ATOMIC-based bio vs xfs software-based > atomic writes (which reuse the CoW infrastructure). And the former is > considerably faster from my testing (for DIO, obvs). But the latter has not > been optimized. For DIO, REQ_ATOMIC IO will generally be faster than the software fallback because no page cache interactions or data copy is required by the DIO REQ_ATOMIC fast path. But we are considering buffered writes, which *must* do a data copy, and so the behaviour and performance differential of doing a COW vs trying to force writeback to do REQ_ATOMIC IO is going to be much different. Consider that the way atomic buffered writes have been implemented in writeback - turning off all folio and IO merging. This means writeback efficiency of atomic writes is going to be horrendous compared to COW writes that don't use REQ_ATOMIC. Further, REQ_ATOMIC buffered writes need to turn off delayed allocation because if you can't allocate aligned extents then the atomic write can *never* be performed. Hence we have to allocate up front where we can return errors to userspace immediately, rather than just reserve space and punt allocation to writeback. i.e. we have to avoid the situation where we have dirty "atomic" data in the page cache that cannot be written because physical allocation fails. The likely outcome of turning off delalloc is that it further degrades buffered atomic write writeback efficiency because it removes the ability for the filesystem to optimise physical locality of writeback IO. e.g. adjacent allocation across multiple small files or packing of random writes in a single file to allow them to merge at the block layer into one big IO... REQ_ATOMIC is a natural fit for DIO because DIO is largely a "one write syscall, one physical IO" style interface. Buffered writes, OTOH, completely decouples application IO from physical IO, and so there is no real "atomic" connection between the data being written into the page caceh and the physical IO that is performed at some time later. This decoupling of physical IO is what brings all the problems and inefficiencies. The filesystem being able to mark the RWF_ATOMIC write range as a COW range at submission time creates a natural "atomic IO" behaviour without requiring the page cache or writeback to even care that the data needs to be written atomically. From there, we optimise the COW IO path to record that the new COW extent was created for the purpose of an atomic write. Then when we go to write back data over that extent, the filesystem can chose to do a REQ_ATOMIC write to do an atomic overwrite instead of allocating a new extent and swapping the BMBT extent pointers at IO completion time. We really don't care if 4x16kB adjacent RWF_ATOMIC writes are submitted as 1x64kB REQ_ATOMIC IO or 4 individual 16kB REQ_ATOMIC IOs. The former is much more efficient from an IO perspective, and the COW path can actually optimise for this because it can track the atomic write ranges in cache exactly. If the range is larger (or unaligned) than what REQ_ATOMIC can handle, we use COW writeback to optimise for maximum writeback bandwidth, otherwise we use REQ_ATOMIC to optimise for minimum writeback submission and completion overhead... IOWs, I think that for XFS (and other COW-capable filesystems) we should be looking at optimising the COW IO path to use REQ_ATOMIC where appropriate to create a direct overwrite fast path for RWF_ATOMIC buffered writes. This seems a more natural and a lot less intrusive than trying to blast through the page caceh abstractions to directly couple userspace IO boundaries to physical writeback IO boundaries... -Dave. -- Dave Chinner david@fromorbit.com
On Tue, Nov 18, 2025 at 07:51:27AM +1100, Dave Chinner wrote: > On Mon, Nov 17, 2025 at 10:59:55AM +0000, John Garry wrote: > > On 16/11/2025 08:11, Dave Chinner wrote: > > > > This patch set focuses on HW accelerated single block atomic writes with > > > > buffered IO, to get some early reviews on the core design. > > > What hardware acceleration? Hardware atomic writes are do not make > > > IO faster; they only change IO failure semantics in certain corner > > > cases. > > > > I think that he references using REQ_ATOMIC-based bio vs xfs software-based > > atomic writes (which reuse the CoW infrastructure). And the former is > > considerably faster from my testing (for DIO, obvs). But the latter has not > > been optimized. > Hi Dave, Thanks for the review and insights. Going through the discussions in previous emails and this email, I understand that there are 2 main points/approaches that you've mentioned: 1. Using COW extents to track atomic ranges - Discussed inline below. 2. Using write-through for RWF_ATOMIC buffered-IO (Suggested in [1]) - [1] https://lore.kernel.org/linux-ext4/aRmHRk7FGD4nCT0s@dread.disaster.area/ - I will respond inline in the above thread. > For DIO, REQ_ATOMIC IO will generally be faster than the software > fallback because no page cache interactions or data copy is required > by the DIO REQ_ATOMIC fast path. > > But we are considering buffered writes, which *must* do a data copy, > and so the behaviour and performance differential of doing a COW vs > trying to force writeback to do REQ_ATOMIC IO is going to be much > different. > > Consider that the way atomic buffered writes have been implemented > in writeback - turning off all folio and IO merging. This means > writeback efficiency of atomic writes is going to be horrendous > compared to COW writes that don't use REQ_ATOMIC. Yes, I agree that it is a bit of an overkill. > > Further, REQ_ATOMIC buffered writes need to turn off delayed > allocation because if you can't allocate aligned extents then the > atomic write can *never* be performed. Hence we have to allocate up > front where we can return errors to userspace immediately, rather > than just reserve space and punt allocation to writeback. i.e. we > have to avoid the situation where we have dirty "atomic" data in the > page cache that cannot be written because physical allocation fails. > > The likely outcome of turning off delalloc is that it further > degrades buffered atomic write writeback efficiency because it > removes the ability for the filesystem to optimise physical locality > of writeback IO. e.g. adjacent allocation across multiple small > files or packing of random writes in a single file to allow them to > merge at the block layer into one big IO... > > REQ_ATOMIC is a natural fit for DIO because DIO is largely a "one > write syscall, one physical IO" style interface. Buffered writes, > OTOH, completely decouples application IO from physical IO, and so > there is no real "atomic" connection between the data being written > into the page caceh and the physical IO that is performed at some > time later. > > This decoupling of physical IO is what brings all the problems and > inefficiencies. The filesystem being able to mark the RWF_ATOMIC > write range as a COW range at submission time creates a natural > "atomic IO" behaviour without requiring the page cache or writeback > to even care that the data needs to be written atomically. > > From there, we optimise the COW IO path to record that > the new COW extent was created for the purpose of an atomic write. > Then when we go to write back data over that extent, the filesystem > can chose to do a REQ_ATOMIC write to do an atomic overwrite instead > of allocating a new extent and swapping the BMBT extent pointers at > IO completion time. > > We really don't care if 4x16kB adjacent RWF_ATOMIC writes are > submitted as 1x64kB REQ_ATOMIC IO or 4 individual 16kB REQ_ATOMIC > IOs. The former is much more efficient from an IO perspective, and > the COW path can actually optimise for this because it can track the > atomic write ranges in cache exactly. If the range is larger (or > unaligned) than what REQ_ATOMIC can handle, we use COW writeback to > optimise for maximum writeback bandwidth, otherwise we use > REQ_ATOMIC to optimise for minimum writeback submission and > completion overhead... Okay IIUC, you are suggesting that, instead of tracking the atomic ranges in page cache and ifs, lets move that to the filesystem, for example in XFS we can: 1. In write iomap_begin path, for RWF_ATOMIC, create a COW extent and mark it as atomic. 2. Carry on with the memcpy to folio and finish the write path. 3. During writeback, at XFS can detect that there is a COW atomic extent. It can then: 3.1 See that it is an overlap that can be done with REQ_ATOMIC directly 3.2 Else, finish the atomic IO in software emulated way just like we do for direct IO currently. I believe the above example with XFS can also be extended to a FS like ext4 without needing COW range, as long as we can ensure that we always meet the conditions for REQ_ATOMIC during writeback (example by using bigalloc for aligned extents and being careful not to cross the atomic write limits) > > IOWs, I think that for XFS (and other COW-capable filesystems) we > should be looking at optimising the COW IO path to use REQ_ATOMIC > where appropriate to create a direct overwrite fast path for > RWF_ATOMIC buffered writes. This seems a more natural and a lot less > intrusive than trying to blast through the page caceh abstractions > to directly couple userspace IO boundaries to physical writeback IO > boundaries... I agree that this approach avoids bloating the page cache and ifs layers with RWF_ATOMIC implementation details. That being said, the task of managing the atomic ranges is now pushed down to the FS and is no longer generic which might introduce friction in onboarding of new FSes in the future. Regardless, from the discussion, I believe at this point we are okay to make that trade-off. Let me take some time to look into the XFS COW paths and try to implement this approach. Thanks for the suggestion! Regards, ojaswin > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Nov 14, 2025 at 02:50:25PM +0530, Ojaswin Mujoo wrote: > buffered IO. Further, many DBs support both direct IO and buffered IO > well and it may not be fair to force them to stick to direct IO to get > the benefits of atomic writes. It may not be fair to force kernel developers to support a feature that has no users.
On Thu, Nov 13, 2025 at 11:12:49AM +0530, Ritesh Harjani wrote: > For e.g. https://lwn.net/Articles/974578/ goes in depth and talks about > Postgres folks looking for this, since PostgreSQL databases uses > buffered I/O for their database writes. Honestly, a database stubbornly using the wrong I/O path should not be a reaѕon for adding this complexity.
syzbot ci has tested the following series
[v1] xfs: single block atomic writes for buffered IO
https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
* [RFC PATCH 1/8] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
* [RFC PATCH 2/8] mm: Add PG_atomic
* [RFC PATCH 3/8] fs: Add initial buffered atomic write support info to statx
* [RFC PATCH 4/8] iomap: buffered atomic write support
* [RFC PATCH 5/8] iomap: pin pages for RWF_ATOMIC buffered write
* [RFC PATCH 6/8] xfs: Report atomic write min and max for buf io as well
* [RFC PATCH 7/8] iomap: Add bs<ps buffered atomic writes support
* [RFC PATCH 8/8] xfs: Lift the bs == ps restriction for HW buffered atomic writes
and found the following issue:
KASAN: slab-out-of-bounds Read in __bitmap_clear
Full report is available here:
https://ci.syzbot.org/series/430a088a-50e2-46d3-87ff-a1f0fa67b66c
***
KASAN: slab-out-of-bounds Read in __bitmap_clear
tree: linux-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base: ab40c92c74c6b0c611c89516794502b3a3173966
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/02d3e137-5d7e-4c95-8f32-43b8663d95df/config
C repro: https://ci.syzbot.org/findings/92a3582f-40a6-4936-8fcd-dc55c447a432/c_repro
syz repro: https://ci.syzbot.org/findings/92a3582f-40a6-4936-8fcd-dc55c447a432/syz_repro
==================================================================
BUG: KASAN: slab-out-of-bounds in __bitmap_clear+0x155/0x180 lib/bitmap.c:395
Read of size 8 at addr ffff88816ced7cd0 by task kworker/0:1/10
CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: xfs-conv/loop0 xfs_end_io
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
__bitmap_clear+0x155/0x180 lib/bitmap.c:395
bitmap_clear include/linux/bitmap.h:496 [inline]
ifs_clear_range_atomic fs/iomap/buffered-io.c:241 [inline]
iomap_clear_range_atomic+0x25c/0x630 fs/iomap/buffered-io.c:268
iomap_finish_folio_write+0x2f0/0x410 fs/iomap/buffered-io.c:1971
iomap_finish_ioend_buffered+0x223/0x5e0 fs/iomap/ioend.c:58
iomap_finish_ioends+0x116/0x2b0 fs/iomap/ioend.c:295
xfs_end_ioend+0x50b/0x690 fs/xfs/xfs_aops.c:168
xfs_end_io+0x253/0x2d0 fs/xfs/xfs_aops.c:205
process_one_work+0x94a/0x15d0 kernel/workqueue.c:3267
process_scheduled_works kernel/workqueue.c:3350 [inline]
worker_thread+0x9b0/0xee0 kernel/workqueue.c:3431
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x599/0xb30 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Allocated by task 5952:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:397 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:414
kasan_kmalloc include/linux/kasan.h:262 [inline]
__do_kmalloc_node mm/slub.c:5672 [inline]
__kmalloc_noprof+0x41d/0x800 mm/slub.c:5684
kmalloc_noprof include/linux/slab.h:961 [inline]
kzalloc_noprof include/linux/slab.h:1094 [inline]
ifs_alloc+0x1e4/0x530 fs/iomap/buffered-io.c:356
iomap_writeback_folio+0x81c/0x26a0 fs/iomap/buffered-io.c:2084
iomap_writepages+0x162/0x2d0 fs/iomap/buffered-io.c:2168
xfs_vm_writepages+0x28a/0x300 fs/xfs/xfs_aops.c:701
do_writepages+0x32e/0x550 mm/page-writeback.c:2598
filemap_writeback mm/filemap.c:387 [inline]
filemap_fdatawrite_range mm/filemap.c:412 [inline]
file_write_and_wait_range+0x23e/0x340 mm/filemap.c:786
xfs_file_fsync+0x195/0x800 fs/xfs/xfs_file.c:137
generic_write_sync include/linux/fs.h:2639 [inline]
xfs_file_buffered_write+0x723/0x8a0 fs/xfs/xfs_file.c:1015
do_iter_readv_writev+0x623/0x8c0 fs/read_write.c:-1
vfs_writev+0x31a/0x960 fs/read_write.c:1057
do_pwritev fs/read_write.c:1153 [inline]
__do_sys_pwritev2 fs/read_write.c:1211 [inline]
__se_sys_pwritev2+0x179/0x290 fs/read_write.c:1202
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff88816ced7c80
which belongs to the cache kmalloc-96 of size 96
The buggy address is located 0 bytes to the right of
allocated 80-byte region [ffff88816ced7c80, ffff88816ced7cd0)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x16ced7
flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 057ff00000000000 ffff888100041280 dead000000000100 dead000000000122
raw: 0000000000000000 0000000080200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x252800(GFP_NOWAIT|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 1, tgid 1 (swapper/0), ts 12041529441, free_ts 0
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851
prep_new_page mm/page_alloc.c:1859 [inline]
get_page_from_freelist+0x2365/0x2440 mm/page_alloc.c:3920
__alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5209
alloc_slab_page mm/slub.c:3086 [inline]
allocate_slab+0x71/0x350 mm/slub.c:3257
new_slab mm/slub.c:3311 [inline]
___slab_alloc+0xf56/0x1990 mm/slub.c:4671
__slab_alloc+0x65/0x100 mm/slub.c:4794
__slab_alloc_node mm/slub.c:4870 [inline]
slab_alloc_node mm/slub.c:5266 [inline]
__kmalloc_cache_node_noprof+0x4b7/0x6f0 mm/slub.c:5799
kmalloc_node_noprof include/linux/slab.h:983 [inline]
alloc_node_nr_active kernel/workqueue.c:4908 [inline]
__alloc_workqueue+0x6a9/0x1b80 kernel/workqueue.c:5762
alloc_workqueue_noprof+0xd4/0x210 kernel/workqueue.c:5822
nbd_dev_add+0x4f1/0xae0 drivers/block/nbd.c:1961
nbd_init+0x168/0x1f0 drivers/block/nbd.c:2691
do_one_initcall+0x25a/0x860 init/main.c:1378
do_initcall_level+0x104/0x190 init/main.c:1440
do_initcalls+0x59/0xa0 init/main.c:1456
kernel_init_freeable+0x334/0x4b0 init/main.c:1688
kernel_init+0x1d/0x1d0 init/main.c:1578
page_owner free stack trace missing
Memory state around the buggy address:
ffff88816ced7b80: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
ffff88816ced7c00: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
>ffff88816ced7c80: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
^
ffff88816ced7d00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff88816ced7d80: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc
==================================================================
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
© 2016 - 2026 Red Hat, Inc.