Implement dmabuf direct I/O via copy_file_range

[PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months, 1 week ago

Main steps to load file data into dmabuf:
1. dmabuf_fd = dmabuf_alloc(len, heap_fd)
2. vaddr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, dmabuf_fd, 0)
3. file_fd = open(file_path, O_RDONLY)
4. read(file_fd, vaddr, len)

dmabuf's attachment/map/fence model sets VM_PFNMAP for mmap, which lacks
direct I/O support[1]. Buffer IO causes latency when loading large AI model
files.

My previous patch added dmabuf ioctl for Direct IO file operations, showing
good performance on low-power CPUs[2][3]. Christian suggested using existing
uAPIs (read/sendfile/splice/c_f_r) instead of new ioctls.

sendfile/splice/c_f_r enable zero-copy via Direct IO for disk-disk/network:
sendfile(skt_fd, disk_fd): [DISK]-DMA->[pipe(buf)]-DMA->[NIC]
sendfile(dst_disk, src_disk): [DISK] -DMA-> [pipe(buf)] -DMA-> [DISK]

Analysis shows existing uAPIs can't achieve zero-copy disk-to-dmabuf.
Since dmabuf lacks file ops, using tmpfs for disk-to-tmpfs CPU analysis:
| Method            | CPU Copies | Key Overhead               |
|-------------------|------------|----------------------------|
| 1. Buffer R+W     | 2          | Alloc(cache) & 2 CPU copies|
| 2. Direct R+W     | 1          | GUP(usr_buf) & 1 CPU copy  |
| 3. Mmap+Buffer R  | 1          | Alloc(cache) & 1 CPU copy  |
| 4. Mmap+Direct R  | 0          | GUP(mem_buf) ~50% CPU      |
| 5. Buffer Sendfile| 1          | Alloc(cache) & 1 CPU copy  |
| 6. Direct Sendfile| 1          | Small pipe, high IRQ       |
| 7. Buffer Splice  | 1          | Alloc(cache) & 1 CPU copy  |
| 8. Direct Splice  | 1          | Larger pipe buffer         |
| 9. c_f_r          | N/A        | Cross-FS blocked           |

GUP: get_user_page
Alloc(cache): allocate page cache

Data flows:
1. [DISK] -DMA-> [Alloc(cache)] -COPY-> [usr_buf] -COPY-> [MEM]
2. [DISK] -DMA-> [GUP(usr_buf)] -COPY-> [MEM]
3. [DISK] -DMA-> [Alloc(cache)] -COPY-> [mem_buf]
4. [DISK] -DMA-> [GUP(mem_buf)]
5. [DISK] -DMA-> [pipe(Alloc(cache))] -COPY-> [tmpfs page]
6. [DISK] -DMA-> [pipe(buf)] -COPY-> [tmpfs page]
7. [DISK] -DMA-> [big_pipe(Alloc(cache))] -COPY-> [tmpfs page]
8. [DISK] -DMA-> [big_pipe(buf)] -COPY-> [tmpfs page]
9. [DISK] -DMA-> [tmpfs page] (blocked)

Key findings:
- Buffer I/O requires page cache allocation and at least one CPU copy
- Read+Write incurs excessive CPU copies and will no longer be analyzed.
  Future approaches will use Read instead of mmap+Read.
- Mmap+Direct has zero copies but 50% GUP overhead, and dmabuf doesn't support
- sendfile/splice require intermediate pipes, needing 1 CPU copy
- c_f_r limitations: Cross-FS blocks + missing memory FS support

Modifications:
1. Enable cross-FS c_f_r for memory file types
2. Add dmabuf c_f_r callbacks for [DISK]-DMA->[dmabuf]
3. Test tmpfs c_f_r locally only (no upstream) due to lock_page deadlock risks

Performance (1GHz CPU, UFS4@4GB):
1. tmpfs(memfd) direct c_f_r(1197 MB/s): +15% vs mmap&read(1014)
2. udmabuf+memfd(2318 MB/s): +50% vs mmap&read(1457 MB/s)
3. dmabuf direct c_f_r(3405 MB/s): 260% faster than buffer IO(918 MB/s)
   40% faster than udmabuf(2318 MB/s)

|    32x32MB Read 1024MB  |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|-------------------------|--------|--------|--------|--------|-----
| 1)Beg  dmabuf buffer R/W|     52 |      5 |   1170 |    918 | 100%
| 2)    udmabuf buffer R/W|    591 |    326 |   1281 |    838 |  91%
| 3)      memfd buffer R/W|      1 |    323 |   2370 |    453 |  49%
| 4)      memfd direct R/W|      1 |    321 |   1058 |   1014 | 110%
| 5) memfd buffer sendfile|      1 |    329 |   1577 |    681 |  74%
| 6) memfd direct sendfile|      1 |    327 |   2672 |    401 |  43%
| 7)   memfd buffer splice|      2 |    321 |   1729 |    621 |  67%
| 8)   memfd direct splice|      2 |    324 |   1528 |    702 |  76%
| 9)    memfd buffer c_f_r|      1 |    325 |   1586 |    677 |  73%
|10)    memfd direct c_f_r|      1 |    323 |    897 |   1197 | 130%
|11)      u+mfd buffer R/W|    609 |    344 |   2207 |    486 |  52%
|12)      u+mfd direct R/W|    580 |    342 |    737 |   1457 | 158%
|13) u+mfd buffer sendfile|    582 |    343 |   1270 |    845 |  92%
|14) u+mfd direct sendfile|    573 |    344 |   2254 |    476 |  51%
|15)   u+mfd buffer splice|    584 |    341 |   1202 |    893 |  97%
|16)   u+mfd direct splice|    564 |    340 |    851 |   1263 | 137%
|17)    u+mfd buffer c_f_r|    585 |    344 |   1244 |    863 |  94%
|18)    u+mfd direct c_f_r|    578 |    341 |    581 |   1848 | 201%
|19)  udmabuf buffer c_f_r|    585 |    328 |   1163 |    923 | 100%
|20)  udmabuf direct c_f_r|    579 |    328 |    464 |   2318 | 252%
|21)   dmabuf buffer c_f_r|     48 |      5 |   1058 |   1015 | 110%
|22)   dmabuf direct c_f_r|     48 |      5 |    316 |   3405 | 370%
|23)End  dmabuf buffer R/W|     48 |      5 |   1173 |    915 |  99%

u+mfd = udma+memfd = udmabuf + pre-allocated memfd combo.
Cache cleared during tests to simulate real-world large file loading.

dmabuf file Use Cases:
- Loading large AI models using dmabuf
- Real-time data capture and storage with dmabuf
- Persisting task snapshots in Android

v3 -> v4:
 Add memory_copy_file_fops to simplify code and add FMODE_ODIRECT check
 Explicitly add dependency headers for udmabuf
 Simplify rw_file implementation in udmabuf/system_heaps
 Set FMODE_ODIRECT for dmabuf supporting Direct I/O
v2 -> v3: [4]
 copy_file_range supports copying from disk files to memory files.
 Implement the copy_file_range callback functions for dmabuf/udmabuf.
v1 -> v2: [3]
 Dma-buf exporter verify exclusive access to the dmabuf's sgtable.
v1: [2]

Reference:
[1] https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com
[2] https://lore.kernel.org/all/20250513092803.2096-1-tao.wangtao@honor.com
[3] https://lore.kernel.org/all/20250516092148.12778-1-tao.wangtao@honor.com
[4] https://lore.kernel.org/all/20250530103941.11092-1-tao.wangtao@honor.com

wangtao (4):
  fs: allow cross-FS copy_file_range for memory file with direct I/O
  dmabuf: Implement copy_file_range callback for dmabuf direct I/O prep
  udmabuf: Implement udmabuf direct I/O
  dmabuf:system_heap Implement system_heap dmabuf direct I/O

 drivers/dma-buf/dma-buf.c           | 32 +++++++++++++
 drivers/dma-buf/heaps/system_heap.c | 69 +++++++++++++++++++++++++++++
 drivers/dma-buf/udmabuf.c           | 54 ++++++++++++++++++++++
 fs/read_write.c                     | 64 +++++++++++++++++++++-----
 include/linux/dma-buf.h             | 16 +++++++
 include/linux/fs.h                  |  2 +
 6 files changed, 225 insertions(+), 12 deletions(-)

-- 
2.17.1

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months, 1 week ago

This is a really weird interface.  No one has yet to explain why dmabuf
is so special that we can't support direct I/O to it when we can support
it to otherwise exotic mappings like PCI P2P ones.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months, 1 week ago

On 6/3/25 15:00, Christoph Hellwig wrote:
> This is a really weird interface.  No one has yet to explain why dmabuf
> is so special that we can't support direct I/O to it when we can support
> it to otherwise exotic mappings like PCI P2P ones.

With udmabuf you can do direct I/O, it's just inefficient to walk the page tables for it when you already have an array of all the folios.

Regards,
Christian.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months, 1 week ago

On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
> On 6/3/25 15:00, Christoph Hellwig wrote:
> > This is a really weird interface.  No one has yet to explain why dmabuf
> > is so special that we can't support direct I/O to it when we can support
> > it to otherwise exotic mappings like PCI P2P ones.
> 
> With udmabuf you can do direct I/O, it's just inefficient to walk the
> page tables for it when you already have an array of all the folios.

Does it matter compared to the I/O in this case?

Either way there has been talk (in case of networking implementations)
that use a dmabuf as a first class container for lower level I/O.
I'd much rather do that than adding odd side interfaces.  I.e. have
a version of splice that doesn't bother with the pipe, but instead
just uses in-kernel direct I/O on one side and dmabuf-provided folios
on the other.

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months, 1 week ago


> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, June 3, 2025 9:20 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: Christoph Hellwig <hch@infradead.org>; wangtao
> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com;
> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
> copy_file_range
> 
> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
> > On 6/3/25 15:00, Christoph Hellwig wrote:
> > > This is a really weird interface.  No one has yet to explain why
> > > dmabuf is so special that we can't support direct I/O to it when we
> > > can support it to otherwise exotic mappings like PCI P2P ones.
> >
> > With udmabuf you can do direct I/O, it's just inefficient to walk the
> > page tables for it when you already have an array of all the folios.
> 
> Does it matter compared to the I/O in this case?
> 
> Either way there has been talk (in case of networking implementations) that
> use a dmabuf as a first class container for lower level I/O.
> I'd much rather do that than adding odd side interfaces.  I.e. have a version
> of splice that doesn't bother with the pipe, but instead just uses in-kernel
> direct I/O on one side and dmabuf-provided folios on the other.
If the VFS layer recognizes dmabuf type and acquires its sg_table
and folios, zero-copy could also be achieved. I initially thought
dmabuf acts as a driver and shouldn't be handled by VFS, so I made
dmabuf implement copy_file_range callbacks to support direct I/O
zero-copy. I'm open to both approaches. What's the preference of
VFS experts?

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months, 1 week ago

On 6/6/25 11:52, wangtao wrote:
> 
> 
>> -----Original Message-----
>> From: Christoph Hellwig <hch@infradead.org>
>> Sent: Tuesday, June 3, 2025 9:20 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: Christoph Hellwig <hch@infradead.org>; wangtao
>> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com;
>> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
>> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
>> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
>> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
>> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
>> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
>> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
>> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
>> yipengxiang <yipengxiang@honor.com>; liulu 00013167
>> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
>>> On 6/3/25 15:00, Christoph Hellwig wrote:
>>>> This is a really weird interface.  No one has yet to explain why
>>>> dmabuf is so special that we can't support direct I/O to it when we
>>>> can support it to otherwise exotic mappings like PCI P2P ones.
>>>
>>> With udmabuf you can do direct I/O, it's just inefficient to walk the
>>> page tables for it when you already have an array of all the folios.
>>
>> Does it matter compared to the I/O in this case?
>>
>> Either way there has been talk (in case of networking implementations) that
>> use a dmabuf as a first class container for lower level I/O.
>> I'd much rather do that than adding odd side interfaces.  I.e. have a version
>> of splice that doesn't bother with the pipe, but instead just uses in-kernel
>> direct I/O on one side and dmabuf-provided folios on the other.
> If the VFS layer recognizes dmabuf type and acquires its sg_table
> and folios, zero-copy could also be achieved. I initially thought
> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> dmabuf implement copy_file_range callbacks to support direct I/O
> zero-copy. I'm open to both approaches. What's the preference of
> VFS experts?

That would probably be illegal. Using the sg_table in the DMA-buf implementation turned out to be a mistake.

The question Christoph raised was rather why is your CPU so slow that walking the page tables has a significant overhead compared to the actual I/O?

Regards,
Christian.

> 
> Regards,
> Wangtao.
>

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
> > dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> > dmabuf implement copy_file_range callbacks to support direct I/O
> > zero-copy. I'm open to both approaches. What's the preference of
> > VFS experts?
> 
> That would probably be illegal. Using the sg_table in the DMA-buf
> implementation turned out to be a mistake.

Two thing here that should not be directly conflated.  Using the
sg_table was a huge mistake, and we should try to move dmabuf to
switch that to a pure dma_addr_t/len array now that the new DMA API
supporting that has been merged.  Is there any chance the dma-buf
maintainers could start to kick this off?  I'm of course happy to
assist.

But that notwithstanding, dma-buf is THE buffer sharing mechanism in
the kernel, and we should promote it instead of reinventing it badly.
And there is a use case for having a fully DMA mapped buffer in the
block layer and I/O path, especially on systems with an IOMMU.
So having an iov_iter backed by a dma-buf would be extremely helpful.
That's mostly lib/iov_iter.c code, not VFS, though.

> The question Christoph raised was rather why is your CPU so slow
> that walking the page tables has a significant overhead compared to
> the actual I/O?

Yes, that's really puzzling and should be addressed first.

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months ago


> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Monday, June 9, 2025 12:35 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig
> <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com;
> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
> copy_file_range
> 
> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
> > > dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> > > dmabuf implement copy_file_range callbacks to support direct I/O
> > > zero-copy. I'm open to both approaches. What's the preference of VFS
> > > experts?
> >
> > That would probably be illegal. Using the sg_table in the DMA-buf
> > implementation turned out to be a mistake.
> 
> Two thing here that should not be directly conflated.  Using the sg_table was
> a huge mistake, and we should try to move dmabuf to switch that to a pure
I'm a bit confused: don't dmabuf importers need to traverse sg_table to
access folios or dma_addr/len? Do you mean restricting sg_table access
(e.g., only via iov_iter) or proposing alternative approaches?

> dma_addr_t/len array now that the new DMA API supporting that has been
> merged.  Is there any chance the dma-buf maintainers could start to kick this
> off?  I'm of course happy to assist.
> 
> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
> kernel, and we should promote it instead of reinventing it badly.
> And there is a use case for having a fully DMA mapped buffer in the block
> layer and I/O path, especially on systems with an IOMMU.
> So having an iov_iter backed by a dma-buf would be extremely helpful.
> That's mostly lib/iov_iter.c code, not VFS, though.
Are you suggesting adding an ITER_DMABUF type to iov_iter, or
implementing dmabuf-to-iov_bvec conversion within iov_iter?

> 
> > The question Christoph raised was rather why is your CPU so slow that
> > walking the page tables has a significant overhead compared to the
> > actual I/O?
> 
> Yes, that's really puzzling and should be addressed first.
With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
is relatively low (observed in 3GHz tests).
|    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|---------------------------|--------|--------|--------|--------|-----
| 1)        memfd direct R/W|      1 |    118 |    312 |   3448 | 100%
| 2)      u+memfd direct R/W|    196 |    123 |    295 |   3651 | 105%
| 3) u+memfd direct sendfile|    175 |    102 |    976 |   1100 |  31%
| 4)   u+memfd direct splice|    173 |    103 |    443 |   2428 |  70%
| 5)      udmabuf buffer R/W|    183 |    100 |    453 |   2375 |  68%
| 6)       dmabuf buffer R/W|     34 |      4 |    427 |   2519 |  73%
| 7)    udmabuf direct c_f_r|    200 |    102 |    278 |   3874 | 112%
| 8)     dmabuf direct c_f_r|     36 |      5 |    269 |   4002 | 116%

With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
significant (as seen in 1GHz tests).
|    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|---------------------------|--------|--------|--------|--------|-----
| 1)        memfd direct R/W|      2 |    393 |    969 |   1109 | 100%
| 2)      u+memfd direct R/W|    592 |    424 |    570 |   1884 | 169%
| 3) u+memfd direct sendfile|    587 |    356 |   2229 |    481 |  43%
| 4)   u+memfd direct splice|    568 |    352 |    795 |   1350 | 121%
| 5)      udmabuf buffer R/W|    597 |    343 |   1238 |    867 |  78%
| 6)       dmabuf buffer R/W|     69 |     13 |   1128 |    952 |  85%
| 7)    udmabuf direct c_f_r|    595 |    345 |    372 |   2889 | 260%
| 8)     dmabuf direct c_f_r|     80 |     13 |    274 |   3929 | 354%

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> Are you suggesting adding an ITER_DMABUF type to iov_iter,

Yes.

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months ago

> 
> On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> 
> Yes.

May I clarify: Do all disk operations require data to pass through
memory (reading into memory or writing from memory)? In the block layer,
the bio structure uses bio_iov_iter_get_pages to convert iter_type
objects into memory-backed bio_vec representations.
However, some dmabufs are not memory-based, making page-to-bio_vec
conversion impossible. This suggests adding a callback function in
dma_buf_ops to handle dmabuf- to-bio_vec conversion.

Interestingly, if such a callback exists, the need for a dedicated
ITER_DMABUF type might disappear. Would you like to discuss potential
implementation tradeoffs here?

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 7 months, 3 weeks ago

On Fri, Jun 13, 2025 at 09:33:55AM +0000, wangtao wrote:
> 
> > 
> > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> > 
> > Yes.
> 
> May I clarify: Do all disk operations require data to pass through
> memory (reading into memory or writing from memory)? In the block layer,
> the bio structure uses bio_iov_iter_get_pages to convert iter_type
> objects into memory-backed bio_vec representations.
> However, some dmabufs are not memory-based, making page-to-bio_vec
> conversion impossible. This suggests adding a callback function in
> dma_buf_ops to handle dmabuf- to-bio_vec conversion.

bios do support PCI P2P tranfers.  This could be fairly easily extended
to other peer to peer transfers if we manage to come up with a coherent
model for them.  No need for a callback.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months ago

On 6/9/25 11:32, wangtao wrote:
>> -----Original Message-----
>> From: Christoph Hellwig <hch@infradead.org>
>> Sent: Monday, June 9, 2025 12:35 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig
>> <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com;
>> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
>> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
>> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
>> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
>> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
>> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
>> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
>> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
>> yipengxiang <yipengxiang@honor.com>; liulu 00013167
>> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
>>>> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
>>>> dmabuf implement copy_file_range callbacks to support direct I/O
>>>> zero-copy. I'm open to both approaches. What's the preference of VFS
>>>> experts?
>>>
>>> That would probably be illegal. Using the sg_table in the DMA-buf
>>> implementation turned out to be a mistake.
>>
>> Two thing here that should not be directly conflated.  Using the sg_table was
>> a huge mistake, and we should try to move dmabuf to switch that to a pure
> I'm a bit confused: don't dmabuf importers need to traverse sg_table to
> access folios or dma_addr/len? Do you mean restricting sg_table access
> (e.g., only via iov_iter) or proposing alternative approaches?

No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden.

We have removed most use cases of that over the years and push back on generating new ones. 

> 
>> dma_addr_t/len array now that the new DMA API supporting that has been
>> merged.  Is there any chance the dma-buf maintainers could start to kick this
>> off?  I'm of course happy to assist.

Work on that is already underway for some time.

Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.

But there are also tons of other users of dma_buf_map_attachment() which needs to be converted.

>> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
>> kernel, and we should promote it instead of reinventing it badly.
>> And there is a use case for having a fully DMA mapped buffer in the block
>> layer and I/O path, especially on systems with an IOMMU.
>> So having an iov_iter backed by a dma-buf would be extremely helpful.
>> That's mostly lib/iov_iter.c code, not VFS, though.
> Are you suggesting adding an ITER_DMABUF type to iov_iter, or
> implementing dmabuf-to-iov_bvec conversion within iov_iter?

That would be rather nice to have, yeah.

> 
>>
>>> The question Christoph raised was rather why is your CPU so slow that
>>> walking the page tables has a significant overhead compared to the
>>> actual I/O?
>>
>> Yes, that's really puzzling and should be addressed first.
> With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> is relatively low (observed in 3GHz tests).

Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.

There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.

Regards,
Christian.


> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      1 |    118 |    312 |   3448 | 100%
> | 2)      u+memfd direct R/W|    196 |    123 |    295 |   3651 | 105%
> | 3) u+memfd direct sendfile|    175 |    102 |    976 |   1100 |  31%
> | 4)   u+memfd direct splice|    173 |    103 |    443 |   2428 |  70%
> | 5)      udmabuf buffer R/W|    183 |    100 |    453 |   2375 |  68%
> | 6)       dmabuf buffer R/W|     34 |      4 |    427 |   2519 |  73%
> | 7)    udmabuf direct c_f_r|    200 |    102 |    278 |   3874 | 112%
> | 8)     dmabuf direct c_f_r|     36 |      5 |    269 |   4002 | 116%
> 
> With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
> significant (as seen in 1GHz tests).
> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      2 |    393 |    969 |   1109 | 100%
> | 2)      u+memfd direct R/W|    592 |    424 |    570 |   1884 | 169%
> | 3) u+memfd direct sendfile|    587 |    356 |   2229 |    481 |  43%
> | 4)   u+memfd direct splice|    568 |    352 |    795 |   1350 | 121%
> | 5)      udmabuf buffer R/W|    597 |    343 |   1238 |    867 |  78%
> | 6)       dmabuf buffer R/W|     69 |     13 |   1128 |    952 |  85%
> | 7)    udmabuf direct c_f_r|    595 |    345 |    372 |   2889 | 260%
> | 8)     dmabuf direct c_f_r|     80 |     13 |    274 |   3929 | 354%
> 
> Regards,
> Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> >> dma_addr_t/len array now that the new DMA API supporting that has been
> >> merged.  Is there any chance the dma-buf maintainers could start to kick this
> >> off?  I'm of course happy to assist.
> 
> Work on that is already underway for some time.
> 
> Most GPU drivers already do sg_table -> DMA array conversion, I need
> to push on the remaining to clean up.

Do you have a pointer?

> >> Yes, that's really puzzling and should be addressed first.
> > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> > is relatively low (observed in 3GHz tests).
> 
> Even on a low end CPU walking the page tables and grabbing references
> shouldn't be that much of an overhead.

Yes.

> 
> There must be some reason why you see so much CPU overhead. E.g.
> compound pages are broken up or similar which should not happen in
> the first place.

pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
last length) array strut pages unfortunately.  The block direct I/O
code has grown code to reassemble folios from them fairly recently
which did speed up some workloads.

Is this test using the block device or iomap direct I/O code?  What
kernel version is it run on?

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months ago


> On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> > >> dma_addr_t/len array now that the new DMA API supporting that has
> > >> been merged.  Is there any chance the dma-buf maintainers could
> > >> start to kick this off?  I'm of course happy to assist.
> >
> > Work on that is already underway for some time.
> >
> > Most GPU drivers already do sg_table -> DMA array conversion, I need
> > to push on the remaining to clean up.
> 
> Do you have a pointer?
> 
> > >> Yes, that's really puzzling and should be addressed first.
> > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages)
> > > overhead is relatively low (observed in 3GHz tests).
> >
> > Even on a low end CPU walking the page tables and grabbing references
> > shouldn't be that much of an overhead.
> 
> Yes.
> 
> >
> > There must be some reason why you see so much CPU overhead. E.g.
> > compound pages are broken up or similar which should not happen in the
> > first place.
> 
> pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
> last length) array strut pages unfortunately.  The block direct I/O code has
> grown code to reassemble folios from them fairly recently which did speed
> up some workloads.
> 
> Is this test using the block device or iomap direct I/O code?  What kernel
> version is it run on?
Here's my analysis on Linux 6.6 with F2FS/iomap.

Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r:
Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs
<20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound.
Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate
latency. The latter avoids this via lightweight bvec assignments.
|- 13.03% __arm64_sys_read
|-|- 13.03% f2fs_file_read_iter
|-|-|- 13.03% __iomap_dio_rw
|-|-|-|- 12.95% iomap_dio_bio_iter
|-|-|-|-|- 10.69% bio_iov_iter_get_pages
|-|-|-|-|-|- 10.53% iov_iter_extract_pages
|-|-|-|-|-|-|- 10.53% pin_user_pages_fast
|-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast
|-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked
|-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages
|-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault
|-|-|-|-|- 1.91% iomap_dio_submit_bio
|-|-|-|-|-|- 1.64% submit_bio

|- 1.13% __arm64_sys_copy_file_range
|-|- 1.13% vfs_copy_file_range
|-|-|- 1.13% dma_buf_copy_file_range
|-|-|-|- 1.13% system_heap_dma_buf_rw_file
|-|-|-|-|- 1.13% f2fs_file_read_iter
|-|-|-|-|-|- 1.13% __iomap_dio_rw
|-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter
|-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio
|-|-|-|-|-|-|-|-|- 1.08% submit_bio

Large folios can reduce GUP overhead but still significantly slower
than dmabuf to bio_vec conversion.

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 7 months, 3 weeks ago

On Fri, Jun 13, 2025 at 09:43:08AM +0000, wangtao wrote:
> Here's my analysis on Linux 6.6 with F2FS/iomap.

Linux 6.6 is almost two years old and completely irrelevant.  Please
provide numbers on 6.16 or current Linus' tree.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months, 1 week ago

On 6/3/25 15:19, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
>> On 6/3/25 15:00, Christoph Hellwig wrote:
>>> This is a really weird interface.  No one has yet to explain why dmabuf
>>> is so special that we can't support direct I/O to it when we can support
>>> it to otherwise exotic mappings like PCI P2P ones.
>>
>> With udmabuf you can do direct I/O, it's just inefficient to walk the
>> page tables for it when you already have an array of all the folios.
> 
> Does it matter compared to the I/O in this case?

It unfortunately does, see the numbers on patch 3 and 4.

I'm not very keen about it either, but I don't see much other way to do this.

> Either way there has been talk (in case of networking implementations)
> that use a dmabuf as a first class container for lower level I/O.
> I'd much rather do that than adding odd side interfaces.  I.e. have
> a version of splice that doesn't bother with the pipe, but instead
> just uses in-kernel direct I/O on one side and dmabuf-provided folios
> on the other.

That would work for me as well. But if splice or copy_file_range is used is not that important to me.

My question is rather if it's ok to call f_op->write_iter() and f_op->read_iter() with pages allocated by alloc_pages(), e.g. where drivers potentially ignore the page count and just re-use pages as they like?

Regards,
Christian.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months, 1 week ago

On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
> > Does it matter compared to the I/O in this case?
> 
> It unfortunately does, see the numbers on patch 3 and 4.

That's kinda weird.  Why does the page table lookup tage so much
time compared to normal I/O?

> My question is rather if it's ok to call f_op->write_iter() and 
> f_op->read_iter() with pages allocated by alloc_pages(), e.g.
> where drivers potentially ignore the page count and just re-use pages
> as they like?

read_iter and write_iter with ITER_BVEC just use the pages as source
and destination of the I/O.  They must not touch the refcounts or
do anything fancy with them.  Various places in the kernel rely on
that.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months, 1 week ago

On 6/3/25 16:28, Christoph Hellwig wrote:
> On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
>>> Does it matter compared to the I/O in this case?
>>
>> It unfortunately does, see the numbers on patch 3 and 4.
> 
> That's kinda weird.  Why does the page table lookup tage so much
> time compared to normal I/O?

I have absolutely no idea. It's rather surprising for me as well.

The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me.

Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array).

We have seen something similar with customers systems which we couldn't explain so far.

>> My question is rather if it's ok to call f_op->write_iter() and 
>> f_op->read_iter() with pages allocated by alloc_pages(), e.g.
>> where drivers potentially ignore the page count and just re-use pages
>> as they like?
> 
> read_iter and write_iter with ITER_BVEC just use the pages as source
> and destination of the I/O.  They must not touch the refcounts or
> do anything fancy with them.  Various places in the kernel rely on
> that.

Perfect, thanks for that info.

Regards,
Christian.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months, 1 week ago

On Tue, Jun 03, 2025 at 05:55:18PM +0200, Christian König wrote:
> On 6/3/25 16:28, Christoph Hellwig wrote:
> > On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
> >>> Does it matter compared to the I/O in this case?
> >>
> >> It unfortunately does, see the numbers on patch 3 and 4.
> > 
> > That's kinda weird.  Why does the page table lookup tage so much
> > time compared to normal I/O?
> 
> I have absolutely no idea. It's rather surprising for me as well.
> 
> The user seems to have a rather slow CPU paired with fast I/O, but it still looks rather fishy to me.
> 
> Additional to that allocating memory through memfd_create() is *much* slower on that box than through dma-buf-heaps (which basically just uses GFP and an array).

Can someone try to reproduce these results on a normal system
before we're building infrastructure based on these numbers?

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months, 1 week ago



> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Wednesday, June 4, 2025 12:02 AM
> To: Christian König <christian.koenig@amd.com>
> Cc: Christoph Hellwig <hch@infradead.org>; wangtao
> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com;
> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
> copy_file_range
> 
> On Tue, Jun 03, 2025 at 05:55:18PM +0200, Christian König wrote:
> > On 6/3/25 16:28, Christoph Hellwig wrote:
> > > On Tue, Jun 03, 2025 at 04:18:22PM +0200, Christian König wrote:
> > >>> Does it matter compared to the I/O in this case?
> > >>
> > >> It unfortunately does, see the numbers on patch 3 and 4.
> > >
> > > That's kinda weird.  Why does the page table lookup tage so much
> > > time compared to normal I/O?
> >
> > I have absolutely no idea. It's rather surprising for me as well.
> >
> > The user seems to have a rather slow CPU paired with fast I/O, but it still
> looks rather fishy to me.
> >
> > Additional to that allocating memory through memfd_create() is *much*
> slower on that box than through dma-buf-heaps (which basically just uses
> GFP and an array).
> 
> Can someone try to reproduce these results on a normal system before
> we're building infrastructure based on these numbers?

Here's my test program. If anyone's interested,
please help test it?

Regards,
Wangtao.

[PATCH] Add dmabuf direct I/O zero-copy test program

Compare latency and throughput of file read/write for
memfd, udmabuf+memfd, udmabuf, and dmabuf buffers.
memfd supports buffer I/O and direct I/O via read/write,
sendfile, and splice user APIs.
udmabuf/dmabuf only support buffer I/O via read/write,
lacking direct I/O, sendfile, and splice support.
Previous patch added dmabuf's copy_file_range callback,
enabling buffer/direct I/O file copies for udmabuf/dmabuf.
u+memfd represents using memfd-created udmabuf with memfd's
user APIs for file copying.

usage: dmabuf-dio [file_path] [size_MB]

Signed-off-by: wangtao <tao.wangtao@honor.com>
---
 tools/testing/selftests/dmabuf-heaps/Makefile |   1 +
 .../selftests/dmabuf-heaps/dmabuf-dio.c       | 617 ++++++++++++++++++
 2 files changed, 618 insertions(+)
 create mode 100644 tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c

diff --git a/tools/testing/selftests/dmabuf-heaps/Makefile b/tools/testing/selftests/dmabuf-heaps/Makefile
index 9e7e158d5fa3..beb6b3e55e17 100644
--- a/tools/testing/selftests/dmabuf-heaps/Makefile
+++ b/tools/testing/selftests/dmabuf-heaps/Makefile
@@ -2,5 +2,6 @@
 CFLAGS += -static -O3 -Wl,-no-as-needed -Wall $(KHDR_INCLUDES)
 
 TEST_GEN_PROGS = dmabuf-heap
+TEST_GEN_PROGS += dmabuf-dio
 
 include ../lib.mk
diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c
new file mode 100644
index 000000000000..eae902a27f29
--- /dev/null
+++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-dio.c
@@ -0,0 +1,617 @@
+#include <linux/dma-heap.h>
+#include <linux/dma-buf.h>
+#include <linux/udmabuf.h>
+#include <sys/mman.h>
+#include <sys/sendfile.h>
+#include <sys/ioctl.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <asm/unistd.h>
+#include <time.h>
+#include <errno.h>
+
+#ifndef TEMP_FAILURE_RETRY
+#define TEMP_FAILURE_RETRY(exp) ({         \
+    __typeof__(exp) _rc;                   \
+    do {                                   \
+        _rc = (exp);                       \
+    } while (_rc == -1 && errno == EINTR); \
+    _rc; })
+#endif
+
+#if 1
+int memfd_create(const char *name, unsigned flags)
+{
+    return syscall(__NR_memfd_create, name, flags);
+}
+
+ssize_t copy_file_range(int fd_in, off_t *off_in, int fd_out, off_t *off_out,
+                size_t len, unsigned flags)
+{
+    return syscall(__NR_copy_file_range, fd_in, off_in, fd_out, off_out, len, flags);
+}
+#endif
+
+int alloc_memfd(size_t size)
+{
+    int memfd = memfd_create("ubuf", MFD_ALLOW_SEALING);
+    if (memfd < 0)
+        return -1;
+
+    int ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
+    if (ret < 0)
+        return -1;
+    ret = TEMP_FAILURE_RETRY(ftruncate(memfd, size));
+    if (ret < 0)
+        return -1;
+    return memfd;
+}
+
+int alloc_udmabuf(size_t size, int memfd)
+{
+    static int udev_fd = -1;
+    if (udev_fd < 0) {
+        udev_fd = open("/dev/udmabuf", O_RDONLY);
+        if (udev_fd < 0)
+            return -1;
+    }
+
+    struct udmabuf_create uc = {0};
+    uc.memfd = memfd;
+    uc.offset = 0;
+    uc.size = size;
+    int buf_fd = TEMP_FAILURE_RETRY(ioctl(udev_fd, UDMABUF_CREATE, &uc));
+    if (buf_fd < 0)
+        return -1;
+
+    return buf_fd;
+}
+
+int alloc_dmabuf(size_t size)
+{
+    static int heap_fd = -1;
+
+    struct dma_heap_allocation_data heap_data = { 0 };
+    heap_data.len = size;  // length of data to be allocated in bytes
+    heap_data.fd_flags = O_RDWR | O_CLOEXEC;  // permissions for the memory to be allocated
+
+    if (heap_fd < 0) {
+        heap_fd = open("/dev/dma_heap/system", O_RDONLY);
+        if (heap_fd < 0)
+            return -1;
+    }
+
+    int ret = TEMP_FAILURE_RETRY(ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &heap_data));
+    if (ret < 0) {
+        return -1;
+    }
+    if (heap_data.fd < 0)
+        return -1;
+
+    return heap_data.fd;
+}
+
+static inline long times_us_duration(struct timespec *ts_start, struct timespec *ts_end)
+{
+    long long start = ts_start->tv_sec * 1000000 + ts_start->tv_nsec / 1000;
+    long long end = ts_end->tv_sec * 1000000 + ts_end->tv_nsec / 1000;
+    return end - start;
+}
+
+static inline long time_us2ms(long us)
+{
+        return (us + 1000 - 1) / 1000;
+}
+
+void drop_pagecaches(int file_fd, loff_t offset, size_t len)
+{
+    if (file_fd >= 0 && len > 0) {
+        posix_fadvise(file_fd, offset, len, POSIX_FADV_DONTNEED);
+        return;
+    }
+
+    int fd = open("/proc/sys/vm/drop_caches", O_WRONLY | O_CLOEXEC);
+    if (fd < 0) {
+        printf("open drop_caches failed %d\n", errno);
+        return;
+    }
+    write(fd, "3", 1);
+    close(fd);
+}
+
+const size_t SIZ_MB = 1024 * 1024;
+const size_t DMABUF_SIZE_MAX = SIZ_MB * 32;
+
+static inline unsigned char test_data_value(unsigned int val)
+{
+    return val % 253;
+}
+
+void test_fill_data(unsigned char* ptr, unsigned int val, size_t sz, bool fast)
+{
+    if (sz > 0 && fast) {
+        ptr[0] = test_data_value(val);
+        ptr[sz / 2] = test_data_value(val + sz / 2);
+        ptr[sz - 1] = test_data_value(val + sz - 1);
+        return;
+    }
+    for (size_t i = 0; i < sz; i++) {
+        ptr[i] = test_data_value(val + i);
+    }
+}
+
+bool test_check_data(unsigned char* ptr, unsigned int val, size_t sz, bool fast)
+{
+    if (sz > 0 && fast) {
+        if (ptr[0] != test_data_value(val))
+            return false;
+        if (ptr[sz / 2] != test_data_value(val + sz / 2))
+            return false;
+        if (ptr[sz - 1] != test_data_value(val + sz - 1))
+            return false;
+        return true;
+    }
+    for (size_t i = 0; i < sz; i++) {
+        if (ptr[i] != test_data_value(val + i))
+            return false;
+    }
+    return true;
+}
+
+enum mem_buf_type {
+    BUF_MEMFD,
+    BUF_UDMA_MEMFD,
+    BUF_UDMABUF,
+    BUF_DMABUF,
+    BUF_TYPE_MAX,
+};
+
+enum copy_io_type {
+    IO_MAP_READ_WRITE,
+    IO_SENDFILE,
+    IO_SPLICE,
+    IO_COPY_FILE_RANGE,
+    IO_TYPE_MAX,
+};
+
+static const char *mem_buf_type_descs[BUF_TYPE_MAX] = {
+    "memfd", "u+memfd", "udmabuf", "dmabuf",
+};
+
+static const char *io_type_descs[IO_TYPE_MAX] = {
+    "R/W", "sendfile", "splice", "c_f_r",
+};
+
+struct mem_buf_st {
+    enum mem_buf_type buf_type_;
+    int io_fd_;
+    int mem_fd_;
+    int buf_fd_;
+    size_t buf_len_;
+    unsigned char *buf_ptr_;
+};
+
+struct mem_buf_tc {
+    enum mem_buf_type buf_type_;
+    enum copy_io_type io_type_;
+    int file_fd_;
+    bool direct_io_;
+    size_t io_len_;
+    long times_create_;
+    long times_data_;
+    long times_io_;
+    long times_close_;
+};
+
+void membuf_deinit(struct mem_buf_st *membuf)
+{
+    if (membuf->buf_ptr_ != NULL && membuf->buf_ptr_ != MAP_FAILED)
+        munmap(membuf->buf_ptr_, membuf->buf_len_);
+    membuf->buf_ptr_ = NULL;
+    if (membuf->mem_fd_ > 0)
+       close(membuf->mem_fd_);
+    if (membuf->buf_fd_ > 0)
+       close(membuf->buf_fd_);
+    membuf->mem_fd_ = -1;
+    membuf->buf_fd_ = -1;
+}
+
+bool membuf_init(struct mem_buf_st *membuf, size_t len, enum mem_buf_type buf_type)
+{
+    int map_fd = -1;
+
+    membuf->mem_fd_ = -1;
+    membuf->buf_fd_ = -1;
+    membuf->buf_len_ = len;
+    membuf->buf_ptr_ = NULL;
+    if (buf_type <= BUF_UDMABUF) {
+        membuf->mem_fd_ = alloc_memfd(len);
+        if (membuf->mem_fd_ < 0) {
+            printf("alloc memfd %zd failed %d\n", len, errno);
+            return false;
+        }
+        map_fd = membuf->mem_fd_;
+        if (buf_type > BUF_MEMFD) {
+            membuf->buf_fd_ = alloc_udmabuf(len, membuf->mem_fd_);
+            if (membuf->buf_fd_ < 0) {
+                printf("alloc udmabuf %zd failed %d\n", len, errno);
+                return false;
+            }
+            if (buf_type == BUF_UDMABUF)
+                map_fd = membuf->buf_fd_;
+        }
+    } else {
+        membuf->buf_fd_ = alloc_dmabuf(len);
+        if (membuf->buf_fd_ < 0) {
+            printf("alloc dmabuf %zd failed %d\n", len, errno);
+            return false;
+        }
+        map_fd = membuf->buf_fd_;
+    }
+    membuf->io_fd_ = map_fd;
+    membuf->buf_ptr_ = (unsigned char *)mmap(NULL, len,
+            PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0);
+    if (membuf->buf_ptr_ == MAP_FAILED) {
+        printf("fd %d map %zd failed %d\n", map_fd, len, errno);
+        membuf->buf_ptr_ = NULL;
+        return false;
+    }
+    return true;
+}
+
+ssize_t membuf_read_write(const struct mem_buf_st *membuf, int file_fd,
+                loff_t off, bool is_read)
+{
+    if (!membuf->buf_ptr_)
+        return -1;
+    lseek(file_fd, off, SEEK_SET);
+    if (is_read)
+        return read(file_fd, membuf->buf_ptr_, membuf->buf_len_);
+    else
+        return write(file_fd, membuf->buf_ptr_, membuf->buf_len_);
+}
+
+ssize_t membuf_sendfile(const struct mem_buf_st *membuf, int file_fd,
+                        loff_t off, bool is_read)
+{
+    int mem_fd = membuf->io_fd_;
+    size_t buf_len = membuf->buf_len_;
+
+    if (mem_fd < 0)
+        return -__LINE__;
+
+    lseek(mem_fd, 0, SEEK_SET);
+    lseek(file_fd, off, SEEK_SET);
+    if (is_read)
+        return sendfile(mem_fd, file_fd, NULL, buf_len);
+    else
+        return sendfile(file_fd, mem_fd, NULL, buf_len);
+}
+
+ssize_t membuf_splice(const struct mem_buf_st *membuf, int file_fd,
+                        loff_t off, bool is_read)
+{
+    size_t len = 0, out_len = 0, buf_len = membuf->buf_len_;
+    int mem_fd = membuf->io_fd_;
+    int fd_in = file_fd, fd_out = mem_fd;
+    ssize_t ret = 0;
+    static int s_pipe_fds[2] = { -1, -1};
+
+    if (mem_fd < 0)
+        return -__LINE__;
+
+    lseek(mem_fd, 0, SEEK_SET);
+    lseek(file_fd, off, SEEK_SET);
+    if (s_pipe_fds[0] < 0) {
+        const int pipe_size = SIZ_MB * 32;
+        int pipe_fds[2];
+        ret = pipe(pipe_fds);
+        if (ret < 0)
+            return -__LINE__;
+        ret = fcntl(pipe_fds[1], F_SETPIPE_SZ, pipe_size);
+        if (ret < 0)
+            return -__LINE__;
+        ret = fcntl(pipe_fds[0], F_GETPIPE_SZ, pipe_size);
+        if (ret != pipe_size)
+            return -__LINE__;
+        s_pipe_fds[0] = pipe_fds[0];
+        s_pipe_fds[1] = pipe_fds[1];
+    }
+
+    if (!is_read) {
+        fd_in = mem_fd;
+        fd_out = file_fd;
+    }
+
+    while (buf_len > len) {
+        ret = splice(fd_in, NULL, s_pipe_fds[1], NULL, buf_len - len, SPLICE_F_NONBLOCK);
+        if (ret <= 0)
+            break;
+        len += ret;
+        do {
+            ret = splice(s_pipe_fds[0], NULL, fd_out, NULL, len - out_len, 0);
+            if (ret <= 0)
+                break;
+            out_len += ret;
+        } while (out_len < len);
+    }
+    return out_len > 0 ? out_len : ret;
+}
+
+ssize_t membuf_cfr(const struct mem_buf_st *membuf, int file_fd, loff_t off,
+                        bool is_read)
+{
+    loff_t mem_pos = 0;
+    loff_t file_pos = off;
+    size_t out_len = 0, buf_len = membuf->buf_len_;
+    int mem_fd = membuf->io_fd_;
+    int fd_in = file_fd, fd_out = mem_fd;
+    loff_t pos_in = file_pos, pos_out = mem_pos;
+    ssize_t ret = 0;
+
+    if (mem_fd < 0)
+        return -__LINE__;
+
+    lseek(mem_fd, mem_pos, SEEK_SET);
+    lseek(file_fd, file_pos, SEEK_SET);
+
+    if (!is_read) {
+        fd_in = mem_fd;
+        fd_out = file_fd;
+        pos_in = mem_pos;
+        pos_out = file_pos;
+    }
+
+    while (buf_len > out_len) {
+        ret = copy_file_range(fd_in, &pos_in, fd_out, &pos_out, buf_len - out_len, 0);
+        if (ret <= 0)
+            break;
+        out_len += ret;
+    }
+    return out_len > 0 ? out_len : ret;
+}
+
+ssize_t membuf_io(const struct mem_buf_st *membuf, int file_fd, loff_t off,
+                        bool is_read, enum copy_io_type io_type)
+{
+    ssize_t ret = 0;
+    if (io_type == IO_MAP_READ_WRITE) {
+        ret = membuf_read_write(membuf, file_fd, off, is_read);
+    } else if (io_type == IO_SENDFILE) {
+        ret = membuf_sendfile(membuf, file_fd, off, is_read);
+    } else if (io_type == IO_SPLICE) {
+        ret = membuf_splice(membuf, file_fd, off, is_read);
+    } else if (io_type == IO_COPY_FILE_RANGE) {
+        ret = membuf_cfr(membuf, file_fd, off, is_read);
+    } else
+        return -1;
+    if (ret < 0)
+        printf("membuf_io io failed %d\n", errno);
+    return ret;
+}
+
+const char *membuf_tc_desc(const struct mem_buf_tc *tc)
+{
+    static char buf[32];
+    snprintf(buf, sizeof(buf), "%s %s %s", mem_buf_type_descs[tc->buf_type_],
+        tc->direct_io_ ? "direct" : "buffer", io_type_descs[tc->io_type_]);
+    return buf;
+}
+
+bool test_membuf(struct mem_buf_tc *tc, loff_t pos, size_t file_len,
+                        size_t buf_siz, bool is_read, bool clean_pagecaches)
+{
+    loff_t off = pos, file_end = pos + file_len;
+    int file_fd = tc->file_fd_;
+    int i = 0, buf_num;
+    struct mem_buf_st *membufs;
+    struct timespec ts_start, ts_end;
+    ssize_t ret;
+
+    if (buf_siz > file_len)
+        buf_siz = file_len;
+    buf_num = (file_len + buf_siz - 1) / buf_siz;
+    membufs = (struct mem_buf_st *)malloc(sizeof(*membufs) * buf_num);
+    if (!membufs)
+        return false;
+
+    memset(membufs, 0, sizeof(*membufs) * buf_num);
+    drop_pagecaches(-1, 0, 0);
+    for (i = 0; i < buf_num && off < file_end; i++, off += buf_siz) {
+        if (buf_siz > file_end - off)
+            buf_siz = file_end - off;
+
+        if (clean_pagecaches)
+            drop_pagecaches(file_fd, off, buf_siz);
+
+        clock_gettime(CLOCK_MONOTONIC, &ts_start);
+        if (!membuf_init(&membufs[i], buf_siz, tc->buf_type_)) {
+             printf("alloc %s %d failed\n", membuf_tc_desc(tc), i);
+             break;
+        }
+        clock_gettime(CLOCK_MONOTONIC, &ts_end);
+        tc->times_create_ += times_us_duration(&ts_start, &ts_end);
+
+        clock_gettime(CLOCK_MONOTONIC, &ts_start);
+        if (!membufs[i].buf_ptr_) {
+            printf("map %s %d failed\n", membuf_tc_desc(tc), i);
+            break;
+        }
+        if (!is_read)
+            test_fill_data(membufs[i].buf_ptr_, off + 1, buf_siz, true);
+        clock_gettime(CLOCK_MONOTONIC, &ts_end);
+        tc->times_data_ += times_us_duration(&ts_start, &ts_end);
+
+        clock_gettime(CLOCK_MONOTONIC, &ts_start);
+        ret = membuf_io(&membufs[i], file_fd, off, is_read, tc->io_type_);
+        if (ret < 0 || ret != buf_siz) {
+            printf("membuf_io %s %d rw %zd ret %zd failed %d\n",
+                membuf_tc_desc(tc), i, buf_siz, ret, errno);
+            break;
+        }
+        clock_gettime(CLOCK_MONOTONIC, &ts_end);
+        tc->times_io_ += times_us_duration(&ts_start, &ts_end);
+
+        clock_gettime(CLOCK_MONOTONIC, &ts_start);
+        if (!test_check_data(membufs[i].buf_ptr_, off + 1, buf_siz, true)) {
+            printf("check data %s %d failed\n", membuf_tc_desc(tc), i);
+            break;
+        }
+        clock_gettime(CLOCK_MONOTONIC, &ts_end);
+        tc->times_data_ += times_us_duration(&ts_start, &ts_end);
+
+        if (clean_pagecaches)
+            drop_pagecaches(file_fd, off, buf_siz);
+    }
+
+    clock_gettime(CLOCK_MONOTONIC, &ts_start);
+    for (i = 0; i < buf_num; i++) {
+        membuf_deinit(&membufs[i]);
+    }
+    clock_gettime(CLOCK_MONOTONIC, &ts_end);
+    tc->times_close_ += times_us_duration(&ts_start, &ts_end);
+    drop_pagecaches(-1, 0, 0);
+    tc->io_len_ = off - pos;
+    return off - pos >= file_end;
+}
+
+bool prepare_init_file(int file_fd, loff_t off, size_t file_len)
+{
+    struct mem_buf_st membuf = {};
+    ssize_t ret;
+
+    ftruncate(file_fd, off + file_len);
+
+    if (!membuf_init(&membuf, file_len, BUF_MEMFD))
+        return false;
+
+    test_fill_data(membuf.buf_ptr_, off + 1, file_len, false);
+    ret = membuf_io(&membuf, file_fd, off, false, IO_MAP_READ_WRITE);
+    membuf_deinit(&membuf);
+
+    return ret >= file_len;
+}
+
+bool prepare_file(const char *filepath, loff_t off, size_t file_len)
+{
+    ssize_t file_end;
+    bool suc = true;
+    int flags = O_RDWR | O_CLOEXEC | O_LARGEFILE | O_CREAT;
+    int file_fd = open(filepath, flags, 0660);
+    if (file_fd < 0) {
+        printf("open %s failed %d\n", filepath, errno);
+        return false;
+    }
+
+    file_end = (size_t)lseek(file_fd, 0, SEEK_END);
+    if (file_end < off + file_len)
+        suc = prepare_init_file(file_fd, off, file_len);
+
+    close(file_fd);
+    return suc;
+}
+
+void test_membuf_cases(struct mem_buf_tc *test_cases, int test_count,
+                loff_t off, size_t file_len, size_t buf_siz,
+                bool is_read, bool clean_pagecaches)
+{
+    char title[64];
+    long file_MB = (file_len + SIZ_MB - 1) / SIZ_MB;
+    long buf_MB = (buf_siz + SIZ_MB - 1) / SIZ_MB;
+    long base_io_time, io_times;
+    long base_io_speed, io_speed;
+    struct mem_buf_tc *tc;
+    int n;
+
+    for (n = 0; n < test_count; n++) {
+        test_membuf(&test_cases[n], off, file_len, buf_siz, is_read, clean_pagecaches);
+    }
+
+    snprintf(title, sizeof(title), "%ldx%ldMB %s %4ldMB",
+             (file_MB + buf_MB - 1) / buf_MB, buf_MB,
+             is_read ? "Read" : "Write", file_MB);
+
+    base_io_time = test_cases[0].times_io_ ?: 1;
+    base_io_speed = test_cases[0].io_len_ / base_io_time ? : 1000;
+
+    printf("|    %-23s|%8s|%8s|%8s|%8s| I/O%%\n", title,
+           "Creat-ms", "Close-ms", "I/O-ms", "I/O-MB/s");
+    printf("|---------------------------|--------|--------|--------|--------|-----\n");
+    for (n = 0; n < test_count; n++) {
+        tc = &test_cases[n];
+        io_times = tc->times_io_;
+        io_speed = io_times > 0 ? tc->io_len_ / io_times : 0;
+
+        printf("|%2d) %23s| %6ld | %6ld | %6ld | %6ld |%4ld%%\n", n + 1,
+               membuf_tc_desc(tc), time_us2ms(tc->times_create_),
+               time_us2ms(tc->times_close_), time_us2ms(io_times),
+               io_speed, io_speed * 100 / base_io_speed);
+    }
+    return;
+}
+
+void test_all_membufs(const char *filepath, loff_t off, size_t file_len, size_t buf_siz,
+                        bool is_read, bool clean_pagecaches)
+{
+    int buffer_fd;
+    int direct_fd;
+
+    buffer_fd = open(filepath, O_RDWR | O_CLOEXEC | O_LARGEFILE);
+    direct_fd = open(filepath, O_RDWR | O_CLOEXEC | O_LARGEFILE | O_DIRECT);
+    if (buffer_fd < 0 || direct_fd < 0) {
+        printf("buffer_fd %d direct_fd %d\n", buffer_fd, direct_fd);
+        return;
+    }
+
+    struct mem_buf_tc test_cases[] = {
+        {BUF_MEMFD, IO_MAP_READ_WRITE, buffer_fd, false},
+        {BUF_MEMFD, IO_MAP_READ_WRITE, direct_fd, true},
+        {BUF_UDMA_MEMFD, IO_MAP_READ_WRITE, buffer_fd, false},
+        {BUF_UDMA_MEMFD, IO_MAP_READ_WRITE, direct_fd, true},
+        {BUF_UDMA_MEMFD, IO_SENDFILE, buffer_fd, false},
+        {BUF_UDMA_MEMFD, IO_SENDFILE, direct_fd, true},
+        {BUF_UDMA_MEMFD, IO_SPLICE, buffer_fd, false},
+        {BUF_UDMA_MEMFD, IO_SPLICE, direct_fd, true},
+        {BUF_UDMABUF, IO_MAP_READ_WRITE, buffer_fd, false},
+        {BUF_DMABUF, IO_MAP_READ_WRITE, buffer_fd, false},
+        {BUF_UDMABUF, IO_COPY_FILE_RANGE, buffer_fd, false},
+        {BUF_UDMABUF, IO_COPY_FILE_RANGE, direct_fd, true},
+        {BUF_DMABUF, IO_COPY_FILE_RANGE, buffer_fd, false},
+        {BUF_DMABUF, IO_COPY_FILE_RANGE, direct_fd, true},
+    };
+
+    test_membuf_cases(test_cases, sizeof(test_cases) / sizeof(test_cases[0]),
+            off, file_len, buf_siz, is_read, clean_pagecaches);
+    close(buffer_fd);
+    close(direct_fd);
+}
+
+void usage(void)
+{
+    printf("usage: dmabuf-dio [file_path] [size_MB]\n");
+    return;
+}
+
+int main(int argc, char *argv[])
+{
+    const char *file_path = "/data/membuf.tmp";
+    size_t file_len = SIZ_MB * 1024;
+
+    if (argc > 1)
+        file_path = argv[1];
+    if (argc > 2)
+        file_len = atoi(argv[2]) * SIZ_MB;
+    if (file_len < 0)
+        file_len = SIZ_MB * 1024;
+    if (!prepare_file(file_path, 0, file_len)) {
+        usage();
+        return -1;
+    }
+
+    test_all_membufs(file_path, 0, file_len, SIZ_MB * 32, true, true);        
+    return 0;
+}
--