[v4] RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months ago


> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, June 3, 2025 9:20 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: Christoph Hellwig <hch@infradead.org>; wangtao
> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com;
> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
> copy_file_range
> 
> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
> > On 6/3/25 15:00, Christoph Hellwig wrote:
> > > This is a really weird interface.  No one has yet to explain why
> > > dmabuf is so special that we can't support direct I/O to it when we
> > > can support it to otherwise exotic mappings like PCI P2P ones.
> >
> > With udmabuf you can do direct I/O, it's just inefficient to walk the
> > page tables for it when you already have an array of all the folios.
> 
> Does it matter compared to the I/O in this case?
> 
> Either way there has been talk (in case of networking implementations) that
> use a dmabuf as a first class container for lower level I/O.
> I'd much rather do that than adding odd side interfaces.  I.e. have a version
> of splice that doesn't bother with the pipe, but instead just uses in-kernel
> direct I/O on one side and dmabuf-provided folios on the other.
If the VFS layer recognizes dmabuf type and acquires its sg_table
and folios, zero-copy could also be achieved. I initially thought
dmabuf acts as a driver and shouldn't be handled by VFS, so I made
dmabuf implement copy_file_range callbacks to support direct I/O
zero-copy. I'm open to both approaches. What's the preference of
VFS experts?

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months ago

On 6/6/25 11:52, wangtao wrote:
> 
> 
>> -----Original Message-----
>> From: Christoph Hellwig <hch@infradead.org>
>> Sent: Tuesday, June 3, 2025 9:20 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: Christoph Hellwig <hch@infradead.org>; wangtao
>> <tao.wangtao@honor.com>; sumit.semwal@linaro.org; kraxel@redhat.com;
>> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
>> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
>> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
>> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
>> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
>> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
>> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
>> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
>> yipengxiang <yipengxiang@honor.com>; liulu 00013167
>> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Tue, Jun 03, 2025 at 03:14:20PM +0200, Christian König wrote:
>>> On 6/3/25 15:00, Christoph Hellwig wrote:
>>>> This is a really weird interface.  No one has yet to explain why
>>>> dmabuf is so special that we can't support direct I/O to it when we
>>>> can support it to otherwise exotic mappings like PCI P2P ones.
>>>
>>> With udmabuf you can do direct I/O, it's just inefficient to walk the
>>> page tables for it when you already have an array of all the folios.
>>
>> Does it matter compared to the I/O in this case?
>>
>> Either way there has been talk (in case of networking implementations) that
>> use a dmabuf as a first class container for lower level I/O.
>> I'd much rather do that than adding odd side interfaces.  I.e. have a version
>> of splice that doesn't bother with the pipe, but instead just uses in-kernel
>> direct I/O on one side and dmabuf-provided folios on the other.
> If the VFS layer recognizes dmabuf type and acquires its sg_table
> and folios, zero-copy could also be achieved. I initially thought
> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> dmabuf implement copy_file_range callbacks to support direct I/O
> zero-copy. I'm open to both approaches. What's the preference of
> VFS experts?

That would probably be illegal. Using the sg_table in the DMA-buf implementation turned out to be a mistake.

The question Christoph raised was rather why is your CPU so slow that walking the page tables has a significant overhead compared to the actual I/O?

Regards,
Christian.

> 
> Regards,
> Wangtao.
>

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
> > dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> > dmabuf implement copy_file_range callbacks to support direct I/O
> > zero-copy. I'm open to both approaches. What's the preference of
> > VFS experts?
> 
> That would probably be illegal. Using the sg_table in the DMA-buf
> implementation turned out to be a mistake.

Two thing here that should not be directly conflated.  Using the
sg_table was a huge mistake, and we should try to move dmabuf to
switch that to a pure dma_addr_t/len array now that the new DMA API
supporting that has been merged.  Is there any chance the dma-buf
maintainers could start to kick this off?  I'm of course happy to
assist.

But that notwithstanding, dma-buf is THE buffer sharing mechanism in
the kernel, and we should promote it instead of reinventing it badly.
And there is a use case for having a fully DMA mapped buffer in the
block layer and I/O path, especially on systems with an IOMMU.
So having an iov_iter backed by a dma-buf would be extremely helpful.
That's mostly lib/iov_iter.c code, not VFS, though.

> The question Christoph raised was rather why is your CPU so slow
> that walking the page tables has a significant overhead compared to
> the actual I/O?

Yes, that's really puzzling and should be addressed first.

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 8 months ago


> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Monday, June 9, 2025 12:35 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig
> <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com;
> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
> yipengxiang <yipengxiang@honor.com>; liulu 00013167
> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
> copy_file_range
> 
> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
> > > dmabuf acts as a driver and shouldn't be handled by VFS, so I made
> > > dmabuf implement copy_file_range callbacks to support direct I/O
> > > zero-copy. I'm open to both approaches. What's the preference of VFS
> > > experts?
> >
> > That would probably be illegal. Using the sg_table in the DMA-buf
> > implementation turned out to be a mistake.
> 
> Two thing here that should not be directly conflated.  Using the sg_table was
> a huge mistake, and we should try to move dmabuf to switch that to a pure
I'm a bit confused: don't dmabuf importers need to traverse sg_table to
access folios or dma_addr/len? Do you mean restricting sg_table access
(e.g., only via iov_iter) or proposing alternative approaches?

> dma_addr_t/len array now that the new DMA API supporting that has been
> merged.  Is there any chance the dma-buf maintainers could start to kick this
> off?  I'm of course happy to assist.
> 
> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
> kernel, and we should promote it instead of reinventing it badly.
> And there is a use case for having a fully DMA mapped buffer in the block
> layer and I/O path, especially on systems with an IOMMU.
> So having an iov_iter backed by a dma-buf would be extremely helpful.
> That's mostly lib/iov_iter.c code, not VFS, though.
Are you suggesting adding an ITER_DMABUF type to iov_iter, or
implementing dmabuf-to-iov_bvec conversion within iov_iter?

> 
> > The question Christoph raised was rather why is your CPU so slow that
> > walking the page tables has a significant overhead compared to the
> > actual I/O?
> 
> Yes, that's really puzzling and should be addressed first.
With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
is relatively low (observed in 3GHz tests).
|    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|---------------------------|--------|--------|--------|--------|-----
| 1)        memfd direct R/W|      1 |    118 |    312 |   3448 | 100%
| 2)      u+memfd direct R/W|    196 |    123 |    295 |   3651 | 105%
| 3) u+memfd direct sendfile|    175 |    102 |    976 |   1100 |  31%
| 4)   u+memfd direct splice|    173 |    103 |    443 |   2428 |  70%
| 5)      udmabuf buffer R/W|    183 |    100 |    453 |   2375 |  68%
| 6)       dmabuf buffer R/W|     34 |      4 |    427 |   2519 |  73%
| 7)    udmabuf direct c_f_r|    200 |    102 |    278 |   3874 | 112%
| 8)     dmabuf direct c_f_r|     36 |      5 |    269 |   4002 | 116%

With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
significant (as seen in 1GHz tests).
|    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
|---------------------------|--------|--------|--------|--------|-----
| 1)        memfd direct R/W|      2 |    393 |    969 |   1109 | 100%
| 2)      u+memfd direct R/W|    592 |    424 |    570 |   1884 | 169%
| 3) u+memfd direct sendfile|    587 |    356 |   2229 |    481 |  43%
| 4)   u+memfd direct splice|    568 |    352 |    795 |   1350 | 121%
| 5)      udmabuf buffer R/W|    597 |    343 |   1238 |    867 |  78%
| 6)       dmabuf buffer R/W|     69 |     13 |   1128 |    952 |  85%
| 7)    udmabuf direct c_f_r|    595 |    345 |    372 |   2889 | 260%
| 8)     dmabuf direct c_f_r|     80 |     13 |    274 |   3929 | 354%

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> Are you suggesting adding an ITER_DMABUF type to iov_iter,

Yes.

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 7 months, 4 weeks ago

> 
> On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> 
> Yes.

May I clarify: Do all disk operations require data to pass through
memory (reading into memory or writing from memory)? In the block layer,
the bio structure uses bio_iov_iter_get_pages to convert iter_type
objects into memory-backed bio_vec representations.
However, some dmabufs are not memory-based, making page-to-bio_vec
conversion impossible. This suggests adding a callback function in
dma_buf_ops to handle dmabuf- to-bio_vec conversion.

Interestingly, if such a callback exists, the need for a dedicated
ITER_DMABUF type might disappear. Would you like to discuss potential
implementation tradeoffs here?

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 7 months, 3 weeks ago

On Fri, Jun 13, 2025 at 09:33:55AM +0000, wangtao wrote:
> 
> > 
> > On Mon, Jun 09, 2025 at 09:32:20AM +0000, wangtao wrote:
> > > Are you suggesting adding an ITER_DMABUF type to iov_iter,
> > 
> > Yes.
> 
> May I clarify: Do all disk operations require data to pass through
> memory (reading into memory or writing from memory)? In the block layer,
> the bio structure uses bio_iov_iter_get_pages to convert iter_type
> objects into memory-backed bio_vec representations.
> However, some dmabufs are not memory-based, making page-to-bio_vec
> conversion impossible. This suggests adding a callback function in
> dma_buf_ops to handle dmabuf- to-bio_vec conversion.

bios do support PCI P2P tranfers.  This could be fairly easily extended
to other peer to peer transfers if we manage to come up with a coherent
model for them.  No need for a callback.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christian König 8 months ago

On 6/9/25 11:32, wangtao wrote:
>> -----Original Message-----
>> From: Christoph Hellwig <hch@infradead.org>
>> Sent: Monday, June 9, 2025 12:35 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: wangtao <tao.wangtao@honor.com>; Christoph Hellwig
>> <hch@infradead.org>; sumit.semwal@linaro.org; kraxel@redhat.com;
>> vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org;
>> hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com;
>> benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
>> jstultz@google.com; tjmercier@google.com; jack@suse.cz;
>> baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri-
>> devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
>> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
>> mm@kvack.org; wangbintian(BintianWang) <bintian.wang@honor.com>;
>> yipengxiang <yipengxiang@honor.com>; liulu 00013167
>> <liulu.liu@honor.com>; hanfeng 00012985 <feng.han@honor.com>
>> Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via
>> copy_file_range
>>
>> On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
>>>> dmabuf acts as a driver and shouldn't be handled by VFS, so I made
>>>> dmabuf implement copy_file_range callbacks to support direct I/O
>>>> zero-copy. I'm open to both approaches. What's the preference of VFS
>>>> experts?
>>>
>>> That would probably be illegal. Using the sg_table in the DMA-buf
>>> implementation turned out to be a mistake.
>>
>> Two thing here that should not be directly conflated.  Using the sg_table was
>> a huge mistake, and we should try to move dmabuf to switch that to a pure
> I'm a bit confused: don't dmabuf importers need to traverse sg_table to
> access folios or dma_addr/len? Do you mean restricting sg_table access
> (e.g., only via iov_iter) or proposing alternative approaches?

No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden.

We have removed most use cases of that over the years and push back on generating new ones. 

> 
>> dma_addr_t/len array now that the new DMA API supporting that has been
>> merged.  Is there any chance the dma-buf maintainers could start to kick this
>> off?  I'm of course happy to assist.

Work on that is already underway for some time.

Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.

But there are also tons of other users of dma_buf_map_attachment() which needs to be converted.

>> But that notwithstanding, dma-buf is THE buffer sharing mechanism in the
>> kernel, and we should promote it instead of reinventing it badly.
>> And there is a use case for having a fully DMA mapped buffer in the block
>> layer and I/O path, especially on systems with an IOMMU.
>> So having an iov_iter backed by a dma-buf would be extremely helpful.
>> That's mostly lib/iov_iter.c code, not VFS, though.
> Are you suggesting adding an ITER_DMABUF type to iov_iter, or
> implementing dmabuf-to-iov_bvec conversion within iov_iter?

That would be rather nice to have, yeah.

> 
>>
>>> The question Christoph raised was rather why is your CPU so slow that
>>> walking the page tables has a significant overhead compared to the
>>> actual I/O?
>>
>> Yes, that's really puzzling and should be addressed first.
> With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> is relatively low (observed in 3GHz tests).

Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.

There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.

Regards,
Christian.


> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      1 |    118 |    312 |   3448 | 100%
> | 2)      u+memfd direct R/W|    196 |    123 |    295 |   3651 | 105%
> | 3) u+memfd direct sendfile|    175 |    102 |    976 |   1100 |  31%
> | 4)   u+memfd direct splice|    173 |    103 |    443 |   2428 |  70%
> | 5)      udmabuf buffer R/W|    183 |    100 |    453 |   2375 |  68%
> | 6)       dmabuf buffer R/W|     34 |      4 |    427 |   2519 |  73%
> | 7)    udmabuf direct c_f_r|    200 |    102 |    278 |   3874 | 112%
> | 8)     dmabuf direct c_f_r|     36 |      5 |    269 |   4002 | 116%
> 
> With lower CPU performance (e.g., 1GHz), GUP overhead becomes more
> significant (as seen in 1GHz tests).
> |    32x32MB Read 1024MB    |Creat-ms|Close-ms|  I/O-ms|I/O-MB/s| I/O%
> |---------------------------|--------|--------|--------|--------|-----
> | 1)        memfd direct R/W|      2 |    393 |    969 |   1109 | 100%
> | 2)      u+memfd direct R/W|    592 |    424 |    570 |   1884 | 169%
> | 3) u+memfd direct sendfile|    587 |    356 |   2229 |    481 |  43%
> | 4)   u+memfd direct splice|    568 |    352 |    795 |   1350 | 121%
> | 5)      udmabuf buffer R/W|    597 |    343 |   1238 |    867 |  78%
> | 6)       dmabuf buffer R/W|     69 |     13 |   1128 |    952 |  85%
> | 7)    udmabuf direct c_f_r|    595 |    345 |    372 |   2889 | 260%
> | 8)     dmabuf direct c_f_r|     80 |     13 |    274 |   3929 | 354%
> 
> Regards,
> Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 8 months ago

On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> >> dma_addr_t/len array now that the new DMA API supporting that has been
> >> merged.  Is there any chance the dma-buf maintainers could start to kick this
> >> off?  I'm of course happy to assist.
> 
> Work on that is already underway for some time.
> 
> Most GPU drivers already do sg_table -> DMA array conversion, I need
> to push on the remaining to clean up.

Do you have a pointer?

> >> Yes, that's really puzzling and should be addressed first.
> > With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead
> > is relatively low (observed in 3GHz tests).
> 
> Even on a low end CPU walking the page tables and grabbing references
> shouldn't be that much of an overhead.

Yes.

> 
> There must be some reason why you see so much CPU overhead. E.g.
> compound pages are broken up or similar which should not happen in
> the first place.

pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
last length) array strut pages unfortunately.  The block direct I/O
code has grown code to reassemble folios from them fairly recently
which did speed up some workloads.

Is this test using the block device or iomap direct I/O code?  What
kernel version is it run on?

RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by wangtao 7 months, 4 weeks ago


> On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> > >> dma_addr_t/len array now that the new DMA API supporting that has
> > >> been merged.  Is there any chance the dma-buf maintainers could
> > >> start to kick this off?  I'm of course happy to assist.
> >
> > Work on that is already underway for some time.
> >
> > Most GPU drivers already do sg_table -> DMA array conversion, I need
> > to push on the remaining to clean up.
> 
> Do you have a pointer?
> 
> > >> Yes, that's really puzzling and should be addressed first.
> > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages)
> > > overhead is relatively low (observed in 3GHz tests).
> >
> > Even on a low end CPU walking the page tables and grabbing references
> > shouldn't be that much of an overhead.
> 
> Yes.
> 
> >
> > There must be some reason why you see so much CPU overhead. E.g.
> > compound pages are broken up or similar which should not happen in the
> > first place.
> 
> pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
> last length) array strut pages unfortunately.  The block direct I/O code has
> grown code to reassemble folios from them fairly recently which did speed
> up some workloads.
> 
> Is this test using the block device or iomap direct I/O code?  What kernel
> version is it run on?
Here's my analysis on Linux 6.6 with F2FS/iomap.

Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r:
Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs
<20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound.
Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate
latency. The latter avoids this via lightweight bvec assignments.
|- 13.03% __arm64_sys_read
|-|- 13.03% f2fs_file_read_iter
|-|-|- 13.03% __iomap_dio_rw
|-|-|-|- 12.95% iomap_dio_bio_iter
|-|-|-|-|- 10.69% bio_iov_iter_get_pages
|-|-|-|-|-|- 10.53% iov_iter_extract_pages
|-|-|-|-|-|-|- 10.53% pin_user_pages_fast
|-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast
|-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked
|-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages
|-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault
|-|-|-|-|- 1.91% iomap_dio_submit_bio
|-|-|-|-|-|- 1.64% submit_bio

|- 1.13% __arm64_sys_copy_file_range
|-|- 1.13% vfs_copy_file_range
|-|-|- 1.13% dma_buf_copy_file_range
|-|-|-|- 1.13% system_heap_dma_buf_rw_file
|-|-|-|-|- 1.13% f2fs_file_read_iter
|-|-|-|-|-|- 1.13% __iomap_dio_rw
|-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter
|-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio
|-|-|-|-|-|-|-|-|- 1.08% submit_bio

Large folios can reduce GUP overhead but still significantly slower
than dmabuf to bio_vec conversion.

Regards,
Wangtao.

Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range

Posted by Christoph Hellwig 7 months, 3 weeks ago

On Fri, Jun 13, 2025 at 09:43:08AM +0000, wangtao wrote:
> Here's my analysis on Linux 6.6 with F2FS/iomap.

Linux 6.6 is almost two years old and completely irrelevant.  Please
provide numbers on 6.16 or current Linus' tree.