[RFC v2 00/11] Add dmabuf read/write via io_uring

Pavel Begunkov posted 11 patches 1 week, 1 day ago
block/Makefile                   |   1 +
block/bdev.c                     |  14 ++
block/bio.c                      |  21 +++
block/blk-merge.c                |  23 +++
block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
block/blk-mq.c                   |  20 +++
block/blk.h                      |   3 +-
block/fops.c                     |   3 +
drivers/nvme/host/pci.c          | 217 ++++++++++++++++++++++++++++
include/linux/bio.h              |  49 ++++---
include/linux/blk-mq-dma-token.h |  60 ++++++++
include/linux/blk-mq.h           |  21 +++
include/linux/blk_types.h        |   8 +-
include/linux/blkdev.h           |   3 +
include/linux/dma_token.h        |  35 +++++
include/linux/fs.h               |   4 +
include/linux/uio.h              |  10 ++
include/uapi/linux/io_uring.h    |  13 +-
io_uring/rsrc.c                  | 201 +++++++++++++++++++++++---
io_uring/rsrc.h                  |  23 ++-
io_uring/rw.c                    |   7 +-
lib/iov_iter.c                   |  30 +++-
22 files changed, 948 insertions(+), 54 deletions(-)
create mode 100644 block/blk-mq-dma-token.c
create mode 100644 include/linux/blk-mq-dma-token.h
create mode 100644 include/linux/dma_token.h
[RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Pavel Begunkov 1 week, 1 day ago
Picking up the work on supporting dmabuf in the read/write path. There
are two main changes. First, it doesn't pass a dma addresss directly by
rather wraps it into an opaque structure, which is extended and
understood by the target driver.

The second big change is support for dynamic attachments, which added a
good part of complexity (see Patch 5). I kept the main machinery in nvme
at first, but move_notify can ask to kill the dma mapping asynchronously,
and any new IO would need to wait during submission, thus it was moved
to blk-mq. That also introduced an extra callback layer b/w driver and
blk-mq.

There are some rough corners, and I'm not perfectly happy about the
complexity and layering. For v3 I'll try to move the waiting up in the
stack to io_uring wrapped into library helpers.

For now, I'm interested what is the best way to test move_notify? And
how dma_resv_reserve_fences() errors should be handled in move_notify?

The uapi didn't change, after registration it looks like a normal
io_uring registered buffer and can be used as such. Only non-vectored
fixed reads/writes are allowed. Pseudo code:

// registration
reg_buf_idx = 0;
io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });

// request creation
io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
                         buffer_size, file_offset, reg_buf_idx);

And as previously, a good bunch of code was taken from Keith's series [1].

liburing based example:

git: https://github.com/isilence/liburing.git dmabuf-rw
link: https://github.com/isilence/liburing/tree/dmabuf-rw

[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/

Pavel Begunkov (11):
  file: add callback for pre-mapping dmabuf
  iov_iter: introduce iter type for pre-registered dma
  block: move around bio flagging helpers
  block: introduce dma token backed bio type
  block: add infra to handle dmabuf tokens
  nvme-pci: add support for dmabuf reggistration
  nvme-pci: implement dma_token backed requests
  io_uring/rsrc: add imu flags
  io_uring/rsrc: extended reg buffer registration
  io_uring/rsrc: add dmabuf-backed buffer registeration
  io_uring/rsrc: implement dmabuf regbuf import

 block/Makefile                   |   1 +
 block/bdev.c                     |  14 ++
 block/bio.c                      |  21 +++
 block/blk-merge.c                |  23 +++
 block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
 block/blk-mq.c                   |  20 +++
 block/blk.h                      |   3 +-
 block/fops.c                     |   3 +
 drivers/nvme/host/pci.c          | 217 ++++++++++++++++++++++++++++
 include/linux/bio.h              |  49 ++++---
 include/linux/blk-mq-dma-token.h |  60 ++++++++
 include/linux/blk-mq.h           |  21 +++
 include/linux/blk_types.h        |   8 +-
 include/linux/blkdev.h           |   3 +
 include/linux/dma_token.h        |  35 +++++
 include/linux/fs.h               |   4 +
 include/linux/uio.h              |  10 ++
 include/uapi/linux/io_uring.h    |  13 +-
 io_uring/rsrc.c                  | 201 +++++++++++++++++++++++---
 io_uring/rsrc.h                  |  23 ++-
 io_uring/rw.c                    |   7 +-
 lib/iov_iter.c                   |  30 +++-
 22 files changed, 948 insertions(+), 54 deletions(-)
 create mode 100644 block/blk-mq-dma-token.c
 create mode 100644 include/linux/blk-mq-dma-token.h
 create mode 100644 include/linux/dma_token.h

-- 
2.52.0
Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Anuj gupta 1 week ago
This series significantly reduces the IOMMU/DMA overhead for I/O,
particularly when the IOMMU is configured in STRICT or LAZY mode. I
modified t/io_uring in fio to exercise this path and tested with an
Intel Optane device. On my setup, I see the following improvement:

- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS

The STRICT/LAZY numbers clearly show the benefit of avoiding per-I/O
dma_map/dma_unmap and reusing the pre-mapped DMA addresses.
--
Anuj Gupta
Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Pavel Begunkov 6 days, 12 hours ago
On 11/24/25 13:35, Anuj gupta wrote:
> This series significantly reduces the IOMMU/DMA overhead for I/O,
> particularly when the IOMMU is configured in STRICT or LAZY mode. I
> modified t/io_uring in fio to exercise this path and tested with an
> Intel Optane device. On my setup, I see the following improvement:
> 
> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
> 
> The STRICT/LAZY numbers clearly show the benefit of avoiding per-I/O
> dma_map/dma_unmap and reusing the pre-mapped DMA addresses.

Thanks for giving it a run. Looks indeed promising, and I believe
that was the main use case Keith was pursuing. I'll fix up the
build problems for v3

-- 
Pavel Begunkov
Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Christian König 1 week ago
On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.

IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.

Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.

On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.

Regards,
Christian.

> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
> 
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
> 
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
> 
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
> 
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
> 
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
> 
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
>                          buffer_size, file_offset, reg_buf_idx);
> 
> And as previously, a good bunch of code was taken from Keith's series [1].
> 
> liburing based example:
> 
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
> 
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
> 
> Pavel Begunkov (11):
>   file: add callback for pre-mapping dmabuf
>   iov_iter: introduce iter type for pre-registered dma
>   block: move around bio flagging helpers
>   block: introduce dma token backed bio type
>   block: add infra to handle dmabuf tokens
>   nvme-pci: add support for dmabuf reggistration
>   nvme-pci: implement dma_token backed requests
>   io_uring/rsrc: add imu flags
>   io_uring/rsrc: extended reg buffer registration
>   io_uring/rsrc: add dmabuf-backed buffer registeration
>   io_uring/rsrc: implement dmabuf regbuf import
> 
>  block/Makefile                   |   1 +
>  block/bdev.c                     |  14 ++
>  block/bio.c                      |  21 +++
>  block/blk-merge.c                |  23 +++
>  block/blk-mq-dma-token.c         | 236 +++++++++++++++++++++++++++++++
>  block/blk-mq.c                   |  20 +++
>  block/blk.h                      |   3 +-
>  block/fops.c                     |   3 +
>  drivers/nvme/host/pci.c          | 217 ++++++++++++++++++++++++++++
>  include/linux/bio.h              |  49 ++++---
>  include/linux/blk-mq-dma-token.h |  60 ++++++++
>  include/linux/blk-mq.h           |  21 +++
>  include/linux/blk_types.h        |   8 +-
>  include/linux/blkdev.h           |   3 +
>  include/linux/dma_token.h        |  35 +++++
>  include/linux/fs.h               |   4 +
>  include/linux/uio.h              |  10 ++
>  include/uapi/linux/io_uring.h    |  13 +-
>  io_uring/rsrc.c                  | 201 +++++++++++++++++++++++---
>  io_uring/rsrc.h                  |  23 ++-
>  io_uring/rw.c                    |   7 +-
>  lib/iov_iter.c                   |  30 +++-
>  22 files changed, 948 insertions(+), 54 deletions(-)
>  create mode 100644 block/blk-mq-dma-token.c
>  create mode 100644 include/linux/blk-mq-dma-token.h
>  create mode 100644 include/linux/dma_token.h
>
Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Pavel Begunkov 1 week ago
On 11/24/25 10:33, Christian König wrote:
> On 11/23/25 23:51, Pavel Begunkov wrote:
>> Picking up the work on supporting dmabuf in the read/write path.
> 
> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
> 
> Or am I mixing something up here?

The time gap is purely due to me being busy. I wasn't CC'ed to those private
discussions you mentioned, but the v1 feedback was to use dynamic attachments
and avoid passing dma address arrays directly.

https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/

I'm lost on what part is not doable. Can you elaborate on the core
dma-fence dma-buf rules?

> Since I don't see any dma_fence implementation at all that might actually be the case.

See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
callback and is signaled when all inflight IO using the current
mapping are complete. All new IO requests will try to recreate the
mapping, and hence potentially wait with dma_resv_wait_timeout().

> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.

Have any reference?

-- 
Pavel Begunkov

Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Christian König 1 week ago
On 11/24/25 12:30, Pavel Begunkov wrote:
> On 11/24/25 10:33, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Picking up the work on supporting dmabuf in the read/write path.
>>
>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>
>> Or am I mixing something up here?
> 
> The time gap is purely due to me being busy. I wasn't CC'ed to those private
> discussions you mentioned, but the v1 feedback was to use dynamic attachments
> and avoid passing dma address arrays directly.
> 
> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
> 
> I'm lost on what part is not doable. Can you elaborate on the core
> dma-fence dma-buf rules?

I most likely mixed that up, in other words that was a different discussion.

When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...

For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.

>> Since I don't see any dma_fence implementation at all that might actually be the case.
> 
> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
> callback and is signaled when all inflight IO using the current
> mapping are complete. All new IO requests will try to recreate the
> mapping, and hence potentially wait with dma_resv_wait_timeout().

Without looking at the code that approach sounds more or less correct to me.

>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
> 
> Have any reference?

There is a WIP feature in AMDs GPU driver package for ROCm.

But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.

BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.

Regards,
Christian.
Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Pavel Begunkov 6 days, 11 hours ago
On 11/24/25 14:17, Christian König wrote:
> On 11/24/25 12:30, Pavel Begunkov wrote:
>> On 11/24/25 10:33, Christian König wrote:
>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>
>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>
>>> Or am I mixing something up here?
>>
>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>> and avoid passing dma address arrays directly.
>>
>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>
>> I'm lost on what part is not doable. Can you elaborate on the core
>> dma-fence dma-buf rules?
> 
> I most likely mixed that up, in other words that was a different discussion.
> 
> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...

I'm curious, what can happen if there is new IO using a
move_notify()ed mapping, but let's say it's guaranteed to complete
strictly before dma_buf_unmap_attachment() and the fence is signaled?
Is there some loss of data or corruption that can happen?

sg_table = map_attach()         |
move_notify()                   |
   -> add_fence(fence)           |
                                 | issue_IO(sg_table)
                                 | // IO completed
unmap_attachment(sg_table)      |
signal_fence(fence)             |

> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.

Looks I have some terminology gap here. By "memory allocations" you
don't mean kmalloc, right? I assume it's about new users of the
mapping.

>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>
>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>> callback and is signaled when all inflight IO using the current
>> mapping are complete. All new IO requests will try to recreate the
>> mapping, and hence potentially wait with dma_resv_wait_timeout().
> 
> Without looking at the code that approach sounds more or less correct to me.
> 
>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>
>> Have any reference?
> 
> There is a WIP feature in AMDs GPU driver package for ROCm.
> 
> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.

Got it

> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.

Right. The direct IO path also works with user pages, so the
constraints are similar in this regard.

-- 
Pavel Begunkov

Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Christian König 6 days, 10 hours ago
On 11/25/25 14:52, Pavel Begunkov wrote:
> On 11/24/25 14:17, Christian König wrote:
>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>> On 11/24/25 10:33, Christian König wrote:
>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>
>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>
>>>> Or am I mixing something up here?
>>>
>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>> and avoid passing dma address arrays directly.
>>>
>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>
>>> I'm lost on what part is not doable. Can you elaborate on the core
>>> dma-fence dma-buf rules?
>>
>> I most likely mixed that up, in other words that was a different discussion.
>>
>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
> 
> I'm curious, what can happen if there is new IO using a
> move_notify()ed mapping, but let's say it's guaranteed to complete
> strictly before dma_buf_unmap_attachment() and the fence is signaled?
> Is there some loss of data or corruption that can happen?

The problem is that you can't guarantee that because you run into deadlocks.

As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.

So you actually can't issue any new IO which might block the unmap operation.

> 
> sg_table = map_attach()         |
> move_notify()                   |
>   -> add_fence(fence)           |
>                                 | issue_IO(sg_table)
>                                 | // IO completed
> unmap_attachment(sg_table)      |
> signal_fence(fence)             |
> 
>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
> 
> Looks I have some terminology gap here. By "memory allocations" you
> don't mean kmalloc, right? I assume it's about new users of the
> mapping.

kmalloc() as well as get_free_page() is exactly what is meant here.

You can't make any memory allocation any more after creating/publishing a dma_fence.

The usually flow is the following:

1. Lock dma_resv object
2. Prepare I/O operation, make all memory allocations etc...
3. Allocate dma_fence object
4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
5. Call dma_resv_add_fence(with fence allocate in #3).
6. Unlock dma_resv object

If you stride from that you most likely end up in a deadlock sooner or later.

Regards,
Christian.

>>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>>
>>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>>> callback and is signaled when all inflight IO using the current
>>> mapping are complete. All new IO requests will try to recreate the
>>> mapping, and hence potentially wait with dma_resv_wait_timeout().
>>
>> Without looking at the code that approach sounds more or less correct to me.
>>
>>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>>
>>> Have any reference?
>>
>> There is a WIP feature in AMDs GPU driver package for ROCm.
>>
>> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
> 
> Got it
> 
>> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
> 
> Right. The direct IO path also works with user pages, so the
> constraints are similar in this regard.
> 

Re: [RFC v2 00/11] Add dmabuf read/write via io_uring
Posted by Pavel Begunkov 6 days, 5 hours ago
On 11/25/25 14:21, Christian König wrote:
> On 11/25/25 14:52, Pavel Begunkov wrote:
>> On 11/24/25 14:17, Christian König wrote:
>>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>>> On 11/24/25 10:33, Christian König wrote:
>>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>>
>>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>>
>>>>> Or am I mixing something up here?
>>>>
>>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>>> and avoid passing dma address arrays directly.
>>>>
>>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>>
>>>> I'm lost on what part is not doable. Can you elaborate on the core
>>>> dma-fence dma-buf rules?
>>>
>>> I most likely mixed that up, in other words that was a different discussion.
>>>
>>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
>>
>> I'm curious, what can happen if there is new IO using a
>> move_notify()ed mapping, but let's say it's guaranteed to complete
>> strictly before dma_buf_unmap_attachment() and the fence is signaled?
>> Is there some loss of data or corruption that can happen?
> 
> The problem is that you can't guarantee that because you run into deadlocks.
> 
> As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.

I think I got the idea, thanks

> So you actually can't issue any new IO which might block the unmap operation.
> 
>>
>> sg_table = map_attach()         |
>> move_notify()                   |
>>    -> add_fence(fence)           |
>>                                  | issue_IO(sg_table)
>>                                  | // IO completed
>> unmap_attachment(sg_table)      |
>> signal_fence(fence)             |
>>
>>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>>
>> Looks I have some terminology gap here. By "memory allocations" you
>> don't mean kmalloc, right? I assume it's about new users of the
>> mapping.
> 
> kmalloc() as well as get_free_page() is exactly what is meant here.
> 
> You can't make any memory allocation any more after creating/publishing a dma_fence.

I see, thanks

> The usually flow is the following:
> 
> 1. Lock dma_resv object
> 2. Prepare I/O operation, make all memory allocations etc...
> 3. Allocate dma_fence object
> 4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
> 5. Call dma_resv_add_fence(with fence allocate in #3).
> 6. Unlock dma_resv object
> 
> If you stride from that you most likely end up in a deadlock sooner or later.
-- 
Pavel Begunkov