[v1] RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 9 months ago

Hi Leon, Jason

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, March 5, 2024 6:19 AM
> To: Christoph Hellwig <hch@lst.de>; Robin Murphy
> <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya
> Kulkarni <chaitanyak@nvidia.com>
> Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>;
> Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>;
> Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>; Kevin Tian
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams
> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
> steps
> 
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
> b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> 
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
> and
> DMA patches are ready for review and inclusion, the NVMe patches are still
> in
> progress as they require agreement on API first.
> 
> Thanks
> 
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
> 
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
> 
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
> 
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.
> 
> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.

I looked into how this scheme can be applied to DRM subsystem and GPU drivers. 

I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB

In GPU driver, we have a few use cases where we need dma-mapping. Just name two:

1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.

We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.  

2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.

For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?

But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.


Oak


> 
> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
>    on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>    directly on BIOs.
> 
> Thanks
> 
> [1]
> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
> klimt.1015granger.net
> 
> Chaitanya Kulkarni (2):
>   block: add dma_link_range() based API
>   nvme-pci: use blk_rq_dma_map() for NVMe SGL
> 
> Leon Romanovsky (14):
>   mm/hmm: let users to tag specific PFNs
>   dma-mapping: provide an interface to allocate IOVA
>   dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>   iommu/dma: Provide an interface to allow preallocate IOVA
>   iommu/dma: Prepare map/unmap page functions to receive IOVA
>   iommu/dma: Implement link/unlink page callbacks
>   RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>   RDMA/umem: Store ODP access mask information in PFN
>   RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>   RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>   vfio/mlx5: Explicitly use number of pages instead of allocated length
>   vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>   vfio/mlx5: Explicitly store page list
>   vfio/mlx5: Convert vfio to use DMA link API
> 
>  Documentation/core-api/dma-attributes.rst |   7 +
>  block/blk-merge.c                         | 156 ++++++++++++++
>  drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>  drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>  drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>  drivers/nvme/host/pci.c                   | 220 +++++--------------
>  drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>  drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>  drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>  include/linux/blk-mq.h                    |   9 +
>  include/linux/dma-map-ops.h               |  13 ++
>  include/linux/dma-mapping.h               |  39 ++++
>  include/linux/hmm.h                       |   3 +
>  include/rdma/ib_umem_odp.h                |  22 +-
>  include/rdma/ib_verbs.h                   |  54 +++++
>  kernel/dma/debug.h                        |   2 +
>  kernel/dma/direct.h                       |   7 +-
>  kernel/dma/mapping.c                      |  91 ++++++++
>  mm/hmm.c                                  |  34 +--
>  20 files changed, 870 insertions(+), 605 deletions(-)
> 
> --
> 2.44.0

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Jason Gunthorpe 1 year, 9 months ago

On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:

> > Instead of teaching DMA to know these specific datatypes, let's separate
> > existing DMA mapping routine to two steps and give an option to advanced
> > callers (subsystems) perform all calculations internally in advance and
> > map pages later when it is needed.
> 
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers. 
> 
> I figured RDMA can apply this scheme because RDMA can calculate the
> iova size. Per my limited knowledge of rdma, user can register a
> memory region (the reg_user_mr vfunc) and memory region's sized is
> used to pre-allocate iova space. And in the RDMA use case, it seems
> the user registered region can be very big, e.g., 512MiB or even GiB

In RDMA the iova would be linked to the SVA granual we discussed
previously.

> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
> 
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> (in Intel's driver it is through a vm_bind api, similar to mmap). A
> userptr can be of any random size, depending on user malloc
> size. Today we use dma-map-sg for this use case. The down side of
> our approach is, during userptr invalidation, even if user only
> munmap partially of an userptr, we invalidate the whole userptr from
> gpu page table, because there is no way for us to partially
> dma-unmap the whole sg list. I think we can try your new API in this
> case. The main benefit of the new approach is the partial munmap
> case.

Yes, this is one of the main things it will improve.
 
> We will have to pre-allocate iova for each userptr, and we have many
> userptrs of random size... So we might be not as efficient as RDMA
> case where I assume user register a few big memory regions.

You are already doing this. dma_map_sg() does exactly the same IOVA
allocation under the covers.
 
> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> program directly, without any other extra driver API call. We call
> this use case system allocator.
 
> For system allocator, driver have no knowledge of which virtual
> address range is valid in advance. So when GPU access a
> malloc'ed/mmap'ed address, we have a page fault. We then look up a
> CPU vma which contains the fault address. I guess we can use the CPU
> vma size to allocate the iova space of the same size?

No. You'd follow what we discussed in the other thread.

If you do a full SVA then you'd split your MM space into granuals and
when a fault hits a granual you'd allocate the IOVA for the whole
granual. RDMA ODP is using a 512M granual currently.

If you are doing sub ranges then you'd probably allocate the IOVA for
the well defined sub range (assuming the typical use case isn't huge)

> But there will be a true difficulty to apply your scheme to this use
> case. It is related to the STICKY flag. As I understand it, the
> sticky flag is designed for driver to mark "this page/pfn has been
> populated, no need to re-populate again", roughly...Unlike userptr
> and RDMA use cases where the backing store of a buffer is always in
> system memory, in the system allocator use case, the backing store
> can be changing b/t system memory and GPU's device private
> memory. Even worse, we have to assume the data migration b/t system
> and GPU is dynamic. When data is migrated to GPU, we don't need
> dma-map. And when migration happens to a pfn with STICKY flag, we
> still need to repopulate this pfn. So you can see, it is not easy to
> apply this scheme to this use case. At least I can't see an obvious
> way.

You are already doing this today, you are keeping the sg list around
until you unmap it.

Instead of keeping the sg list you'd keep a much smaller datastructure
per-granual. The sticky bit is simply a convient way for ODP to manage
the smaller data structure, you don't have to use it.

But you do need to keep track of what pages in the granual have been
DMA mapped - sg list was doing this before. This could be a simple
bitmap array matching the granual size.

Looking (far) forward we may be able to have a "replace" API that
allows installing a new page unconditionally regardless of what is
already there.

Jason

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 8 months ago

Hi Jason, Leon,

I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.

Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Cc Himal/Krishna who are also working/testing the new API.

Thanks,
Oak

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, May 3, 2024 12:43 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> 
> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to
> advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > I looked into how this scheme can be applied to DRM subsystem and GPU
> drivers.
> >
> > I figured RDMA can apply this scheme because RDMA can calculate the
> > iova size. Per my limited knowledge of rdma, user can register a
> > memory region (the reg_user_mr vfunc) and memory region's sized is
> > used to pre-allocate iova space. And in the RDMA use case, it seems
> > the user registered region can be very big, e.g., 512MiB or even GiB
> 
> In RDMA the iova would be linked to the SVA granual we discussed
> previously.
> 
> > In GPU driver, we have a few use cases where we need dma-mapping. Just
> name two:
> >
> > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > userptr can be of any random size, depending on user malloc
> > size. Today we use dma-map-sg for this use case. The down side of
> > our approach is, during userptr invalidation, even if user only
> > munmap partially of an userptr, we invalidate the whole userptr from
> > gpu page table, because there is no way for us to partially
> > dma-unmap the whole sg list. I think we can try your new API in this
> > case. The main benefit of the new approach is the partial munmap
> > case.
> 
> Yes, this is one of the main things it will improve.
> 
> > We will have to pre-allocate iova for each userptr, and we have many
> > userptrs of random size... So we might be not as efficient as RDMA
> > case where I assume user register a few big memory regions.
> 
> You are already doing this. dma_map_sg() does exactly the same IOVA
> allocation under the covers.
> 
> > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > program directly, without any other extra driver API call. We call
> > this use case system allocator.
> 
> > For system allocator, driver have no knowledge of which virtual
> > address range is valid in advance. So when GPU access a
> > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > CPU vma which contains the fault address. I guess we can use the CPU
> > vma size to allocate the iova space of the same size?
> 
> No. You'd follow what we discussed in the other thread.
> 
> If you do a full SVA then you'd split your MM space into granuals and
> when a fault hits a granual you'd allocate the IOVA for the whole
> granual. RDMA ODP is using a 512M granual currently.
> 
> If you are doing sub ranges then you'd probably allocate the IOVA for
> the well defined sub range (assuming the typical use case isn't huge)
> 
> > But there will be a true difficulty to apply your scheme to this use
> > case. It is related to the STICKY flag. As I understand it, the
> > sticky flag is designed for driver to mark "this page/pfn has been
> > populated, no need to re-populate again", roughly...Unlike userptr
> > and RDMA use cases where the backing store of a buffer is always in
> > system memory, in the system allocator use case, the backing store
> > can be changing b/t system memory and GPU's device private
> > memory. Even worse, we have to assume the data migration b/t system
> > and GPU is dynamic. When data is migrated to GPU, we don't need
> > dma-map. And when migration happens to a pfn with STICKY flag, we
> > still need to repopulate this pfn. So you can see, it is not easy to
> > apply this scheme to this use case. At least I can't see an obvious
> > way.
> 
> You are already doing this today, you are keeping the sg list around
> until you unmap it.
> 
> Instead of keeping the sg list you'd keep a much smaller datastructure
> per-granual. The sticky bit is simply a convient way for ODP to manage
> the smaller data structure, you don't have to use it.
> 
> But you do need to keep track of what pages in the granual have been
> DMA mapped - sg list was doing this before. This could be a simple
> bitmap array matching the granual size.
> 
> Looking (far) forward we may be able to have a "replace" API that
> allows installing a new page unconditionally regardless of what is
> already there.
> 
> Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Leon Romanovsky 1 year, 8 months ago

On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> Hi Jason, Leon,
> 
> I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.

Sounds great, thanks for the feedback.

> 
> Did you test the new API in RDMA subsystem? 

This version was tested in our regression tests, but there is a chance
that you are hitting flows that were not relevant for RDMA case.

> Or this RFC series was just some untested codes sending out to get people's design feedback? 

RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.

> Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Yes, as an outcome of the feedback in this thread, I implemented a new
version. Unfortunately, there are some personal matters that are preventing
from me to send it right away.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1

There are some differences in the API, but the main idea is the same.
This version is not fully tested yet.

Thanks

> 
> Cc Himal/Krishna who are also working/testing the new API.
> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Friday, May 3, 2024 12:43 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> > <robin.murphy@arm.com>; Marek Szyprowski
> > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > <bvanassche@acm.org>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> > 
> > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > 
> > > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > > existing DMA mapping routine to two steps and give an option to
> > advanced
> > > > callers (subsystems) perform all calculations internally in advance and
> > > > map pages later when it is needed.
> > >
> > > I looked into how this scheme can be applied to DRM subsystem and GPU
> > drivers.
> > >
> > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > iova size. Per my limited knowledge of rdma, user can register a
> > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > the user registered region can be very big, e.g., 512MiB or even GiB
> > 
> > In RDMA the iova would be linked to the SVA granual we discussed
> > previously.
> > 
> > > In GPU driver, we have a few use cases where we need dma-mapping. Just
> > name two:
> > >
> > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > userptr can be of any random size, depending on user malloc
> > > size. Today we use dma-map-sg for this use case. The down side of
> > > our approach is, during userptr invalidation, even if user only
> > > munmap partially of an userptr, we invalidate the whole userptr from
> > > gpu page table, because there is no way for us to partially
> > > dma-unmap the whole sg list. I think we can try your new API in this
> > > case. The main benefit of the new approach is the partial munmap
> > > case.
> > 
> > Yes, this is one of the main things it will improve.
> > 
> > > We will have to pre-allocate iova for each userptr, and we have many
> > > userptrs of random size... So we might be not as efficient as RDMA
> > > case where I assume user register a few big memory regions.
> > 
> > You are already doing this. dma_map_sg() does exactly the same IOVA
> > allocation under the covers.
> > 
> > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > program directly, without any other extra driver API call. We call
> > > this use case system allocator.
> > 
> > > For system allocator, driver have no knowledge of which virtual
> > > address range is valid in advance. So when GPU access a
> > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > CPU vma which contains the fault address. I guess we can use the CPU
> > > vma size to allocate the iova space of the same size?
> > 
> > No. You'd follow what we discussed in the other thread.
> > 
> > If you do a full SVA then you'd split your MM space into granuals and
> > when a fault hits a granual you'd allocate the IOVA for the whole
> > granual. RDMA ODP is using a 512M granual currently.
> > 
> > If you are doing sub ranges then you'd probably allocate the IOVA for
> > the well defined sub range (assuming the typical use case isn't huge)
> > 
> > > But there will be a true difficulty to apply your scheme to this use
> > > case. It is related to the STICKY flag. As I understand it, the
> > > sticky flag is designed for driver to mark "this page/pfn has been
> > > populated, no need to re-populate again", roughly...Unlike userptr
> > > and RDMA use cases where the backing store of a buffer is always in
> > > system memory, in the system allocator use case, the backing store
> > > can be changing b/t system memory and GPU's device private
> > > memory. Even worse, we have to assume the data migration b/t system
> > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > still need to repopulate this pfn. So you can see, it is not easy to
> > > apply this scheme to this use case. At least I can't see an obvious
> > > way.
> > 
> > You are already doing this today, you are keeping the sg list around
> > until you unmap it.
> > 
> > Instead of keeping the sg list you'd keep a much smaller datastructure
> > per-granual. The sticky bit is simply a convient way for ODP to manage
> > the smaller data structure, you don't have to use it.
> > 
> > But you do need to keep track of what pages in the granual have been
> > DMA mapped - sg list was doing this before. This could be a simple
> > bitmap array matching the granual size.
> > 
> > Looking (far) forward we may be able to have a "replace" API that
> > allows installing a new page unconditionally regardless of what is
> > already there.
> > 
> > Jason

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 8 months ago

Thanks Leon and Yanjun for the reply!

Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed.

Thanks,
Oak

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Monday, June 10, 2024 12:18 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin
> Murphy <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> > Hi Jason, Leon,
> >
> > I come back to this thread to ask a question. Per the discussion in another
> thread, I have integrated the new dma-mapping API (the first 6 patches of
> this series) to DRM subsystem. The new API seems fit pretty good to our
> purpose, better than scatter-gather dma-mapping. So we want to continue
> work with you to adopt this new API.
> 
> Sounds great, thanks for the feedback.
> 
> >
> > Did you test the new API in RDMA subsystem?
> 
> This version was tested in our regression tests, but there is a chance
> that you are hitting flows that were not relevant for RDMA case.
> 
> > Or this RFC series was just some untested codes sending out to get
> people's design feedback?
> 
> RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.
> 
> > Do you have refined version for us to try? I ask because we are seeing
> some issues but not sure whether it is caused by the new API. We are
> debugging but it would be good to also ask at the same time.
> 
> Yes, as an outcome of the feedback in this thread, I implemented a new
> version. Unfortunately, there are some personal matters that are preventing
> from me to send it right away.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-
> rdma.git/log/?h=dma-split-v1
> 
> There are some differences in the API, but the main idea is the same.
> This version is not fully tested yet.
> 
> Thanks
> 
> >
> > Cc Himal/Krishna who are also working/testing the new API.
> >
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Friday, May 3, 2024 12:43 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> > > <robin.murphy@arm.com>; Marek Szyprowski
> > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>;
> Jens
> > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian,
> Kevin
> > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > > foundation.org>; linux-doc@vger.kernel.org; linux-
> kernel@vger.kernel.org;
> > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > > <bvanassche@acm.org>; Damien Le Moal
> > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> > > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > two steps
> > >
> > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > >
> > > > > Instead of teaching DMA to know these specific datatypes, let's
> separate
> > > > > existing DMA mapping routine to two steps and give an option to
> > > advanced
> > > > > callers (subsystems) perform all calculations internally in advance and
> > > > > map pages later when it is needed.
> > > >
> > > > I looked into how this scheme can be applied to DRM subsystem and
> GPU
> > > drivers.
> > > >
> > > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > > iova size. Per my limited knowledge of rdma, user can register a
> > > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > > the user registered region can be very big, e.g., 512MiB or even GiB
> > >
> > > In RDMA the iova would be linked to the SVA granual we discussed
> > > previously.
> > >
> > > > In GPU driver, we have a few use cases where we need dma-mapping.
> Just
> > > name two:
> > > >
> > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > > userptr can be of any random size, depending on user malloc
> > > > size. Today we use dma-map-sg for this use case. The down side of
> > > > our approach is, during userptr invalidation, even if user only
> > > > munmap partially of an userptr, we invalidate the whole userptr from
> > > > gpu page table, because there is no way for us to partially
> > > > dma-unmap the whole sg list. I think we can try your new API in this
> > > > case. The main benefit of the new approach is the partial munmap
> > > > case.
> > >
> > > Yes, this is one of the main things it will improve.
> > >
> > > > We will have to pre-allocate iova for each userptr, and we have many
> > > > userptrs of random size... So we might be not as efficient as RDMA
> > > > case where I assume user register a few big memory regions.
> > >
> > > You are already doing this. dma_map_sg() does exactly the same IOVA
> > > allocation under the covers.
> > >
> > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > > program directly, without any other extra driver API call. We call
> > > > this use case system allocator.
> > >
> > > > For system allocator, driver have no knowledge of which virtual
> > > > address range is valid in advance. So when GPU access a
> > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > > CPU vma which contains the fault address. I guess we can use the CPU
> > > > vma size to allocate the iova space of the same size?
> > >
> > > No. You'd follow what we discussed in the other thread.
> > >
> > > If you do a full SVA then you'd split your MM space into granuals and
> > > when a fault hits a granual you'd allocate the IOVA for the whole
> > > granual. RDMA ODP is using a 512M granual currently.
> > >
> > > If you are doing sub ranges then you'd probably allocate the IOVA for
> > > the well defined sub range (assuming the typical use case isn't huge)
> > >
> > > > But there will be a true difficulty to apply your scheme to this use
> > > > case. It is related to the STICKY flag. As I understand it, the
> > > > sticky flag is designed for driver to mark "this page/pfn has been
> > > > populated, no need to re-populate again", roughly...Unlike userptr
> > > > and RDMA use cases where the backing store of a buffer is always in
> > > > system memory, in the system allocator use case, the backing store
> > > > can be changing b/t system memory and GPU's device private
> > > > memory. Even worse, we have to assume the data migration b/t
> system
> > > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > > still need to repopulate this pfn. So you can see, it is not easy to
> > > > apply this scheme to this use case. At least I can't see an obvious
> > > > way.
> > >
> > > You are already doing this today, you are keeping the sg list around
> > > until you unmap it.
> > >
> > > Instead of keeping the sg list you'd keep a much smaller datastructure
> > > per-granual. The sticky bit is simply a convient way for ODP to manage
> > > the smaller data structure, you don't have to use it.
> > >
> > > But you do need to keep track of what pages in the granual have been
> > > DMA mapped - sg list was doing this before. This could be a simple
> > > bitmap array matching the granual size.
> > >
> > > Looking (far) forward we may be able to have a "replace" API that
> > > allows installing a new page unconditionally regardless of what is
> > > already there.
> > >
> > > Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Leon Romanovsky 1 year, 8 months ago

On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> Thanks Leon and Yanjun for the reply!
> 
> Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed.

Sounds good, if v0 fits your need, the v1 will fit it too.

From HMM perspective, the change is minimal between them.
In v0, I called to dma_link_page() here and now it is called
dma_hmm_link_page().

https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/diff/drivers/infiniband/hw/mlx5/odp.c?h=dma-split-v1&id=a0d719a406133cdc3ef2328dda3ef082a034c45e


> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Monday, June 10, 2024 12:18 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin
> > Murphy <robin.murphy@arm.com>; Marek Szyprowski
> > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > <bvanassche@acm.org>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> > 
> > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> > > Hi Jason, Leon,
> > >
> > > I come back to this thread to ask a question. Per the discussion in another
> > thread, I have integrated the new dma-mapping API (the first 6 patches of
> > this series) to DRM subsystem. The new API seems fit pretty good to our
> > purpose, better than scatter-gather dma-mapping. So we want to continue
> > work with you to adopt this new API.
> > 
> > Sounds great, thanks for the feedback.
> > 
> > >
> > > Did you test the new API in RDMA subsystem?
> > 
> > This version was tested in our regression tests, but there is a chance
> > that you are hitting flows that were not relevant for RDMA case.
> > 
> > > Or this RFC series was just some untested codes sending out to get
> > people's design feedback?
> > 
> > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.
> > 
> > > Do you have refined version for us to try? I ask because we are seeing
> > some issues but not sure whether it is caused by the new API. We are
> > debugging but it would be good to also ask at the same time.
> > 
> > Yes, as an outcome of the feedback in this thread, I implemented a new
> > version. Unfortunately, there are some personal matters that are preventing
> > from me to send it right away.
> > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-
> > rdma.git/log/?h=dma-split-v1
> > 
> > There are some differences in the API, but the main idea is the same.
> > This version is not fully tested yet.
> > 
> > Thanks
> > 
> > >
> > > Cc Himal/Krishna who are also working/testing the new API.
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Friday, May 3, 2024 12:43 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> > > > <robin.murphy@arm.com>; Marek Szyprowski
> > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>;
> > Jens
> > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian,
> > Kevin
> > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > > > foundation.org>; linux-doc@vger.kernel.org; linux-
> > kernel@vger.kernel.org;
> > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > > > <bvanassche@acm.org>; Damien Le Moal
> > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > > > <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> > > > <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > > two steps
> > > >
> > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > > >
> > > > > > Instead of teaching DMA to know these specific datatypes, let's
> > separate
> > > > > > existing DMA mapping routine to two steps and give an option to
> > > > advanced
> > > > > > callers (subsystems) perform all calculations internally in advance and
> > > > > > map pages later when it is needed.
> > > > >
> > > > > I looked into how this scheme can be applied to DRM subsystem and
> > GPU
> > > > drivers.
> > > > >
> > > > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > > > iova size. Per my limited knowledge of rdma, user can register a
> > > > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > > > the user registered region can be very big, e.g., 512MiB or even GiB
> > > >
> > > > In RDMA the iova would be linked to the SVA granual we discussed
> > > > previously.
> > > >
> > > > > In GPU driver, we have a few use cases where we need dma-mapping.
> > Just
> > > > name two:
> > > > >
> > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > > > userptr can be of any random size, depending on user malloc
> > > > > size. Today we use dma-map-sg for this use case. The down side of
> > > > > our approach is, during userptr invalidation, even if user only
> > > > > munmap partially of an userptr, we invalidate the whole userptr from
> > > > > gpu page table, because there is no way for us to partially
> > > > > dma-unmap the whole sg list. I think we can try your new API in this
> > > > > case. The main benefit of the new approach is the partial munmap
> > > > > case.
> > > >
> > > > Yes, this is one of the main things it will improve.
> > > >
> > > > > We will have to pre-allocate iova for each userptr, and we have many
> > > > > userptrs of random size... So we might be not as efficient as RDMA
> > > > > case where I assume user register a few big memory regions.
> > > >
> > > > You are already doing this. dma_map_sg() does exactly the same IOVA
> > > > allocation under the covers.
> > > >
> > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > > > program directly, without any other extra driver API call. We call
> > > > > this use case system allocator.
> > > >
> > > > > For system allocator, driver have no knowledge of which virtual
> > > > > address range is valid in advance. So when GPU access a
> > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > > > CPU vma which contains the fault address. I guess we can use the CPU
> > > > > vma size to allocate the iova space of the same size?
> > > >
> > > > No. You'd follow what we discussed in the other thread.
> > > >
> > > > If you do a full SVA then you'd split your MM space into granuals and
> > > > when a fault hits a granual you'd allocate the IOVA for the whole
> > > > granual. RDMA ODP is using a 512M granual currently.
> > > >
> > > > If you are doing sub ranges then you'd probably allocate the IOVA for
> > > > the well defined sub range (assuming the typical use case isn't huge)
> > > >
> > > > > But there will be a true difficulty to apply your scheme to this use
> > > > > case. It is related to the STICKY flag. As I understand it, the
> > > > > sticky flag is designed for driver to mark "this page/pfn has been
> > > > > populated, no need to re-populate again", roughly...Unlike userptr
> > > > > and RDMA use cases where the backing store of a buffer is always in
> > > > > system memory, in the system allocator use case, the backing store
> > > > > can be changing b/t system memory and GPU's device private
> > > > > memory. Even worse, we have to assume the data migration b/t
> > system
> > > > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > > > still need to repopulate this pfn. So you can see, it is not easy to
> > > > > apply this scheme to this use case. At least I can't see an obvious
> > > > > way.
> > > >
> > > > You are already doing this today, you are keeping the sg list around
> > > > until you unmap it.
> > > >
> > > > Instead of keeping the sg list you'd keep a much smaller datastructure
> > > > per-granual. The sticky bit is simply a convient way for ODP to manage
> > > > the smaller data structure, you don't have to use it.
> > > >
> > > > But you do need to keep track of what pages in the granual have been
> > > > DMA mapped - sg list was doing this before. This could be a simple
> > > > bitmap array matching the granual size.
> > > >
> > > > Looking (far) forward we may be able to have a "replace" API that
> > > > allows installing a new page unconditionally regardless of what is
> > > > already there.
> > > >
> > > > Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Jason Gunthorpe 1 year, 8 months ago

On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> Thanks Leon and Yanjun for the reply!
> 
> Based on the reply, we will continue use the current version for
> test (as it is tested for vfio and rdma). We will switch to v1 once
> it is fully tested/reviewed.

I'm glad you are finding it useful, one of my interests with this work
is to improve all the HMM users.

Jason

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 8 months ago

Hi Jason, Leon,

I was able to fix the issue from my side. Things work fine now. I got two questions though:

1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?

[17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
[17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
[17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
[17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000

2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
Maybe I need more test. But any comment?

Thanks,
Oak

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Monday, June 10, 2024 1:25 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>;
> Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > Thanks Leon and Yanjun for the reply!
> >
> > Based on the reply, we will continue use the current version for
> > test (as it is tested for vfio and rdma). We will switch to v1 once
> > it is fully tested/reviewed.
> 
> I'm glad you are finding it useful, one of my interests with this work
> is to improve all the HMM users.
> 
> Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Leon Romanovsky 1 year, 8 months ago

On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> Hi Jason, Leon,
> 
> I was able to fix the issue from my side. Things work fine now. I got two questions though:
> 
> 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
> I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?

My API gives you the flexibility to use any page size you want. You can
use 2M pages instead of 4K pages. The API doesn't enforce any page size.

> 
> [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
> [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
> [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
> [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000
> 
> 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
> Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
> In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
> Maybe I need more test. But any comment?

You did it perfectly right. This is the correct way to advance dma_offset.

Thanks

> 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Monday, June 10, 2024 1:25 PM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>;
> > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski
> > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > <bvanassche@acm.org>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> > 
> > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > Thanks Leon and Yanjun for the reply!
> > >
> > > Based on the reply, we will continue use the current version for
> > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > it is fully tested/reviewed.
> > 
> > I'm glad you are finding it useful, one of my interests with this work
> > is to improve all the HMM users.
> > 
> > Jason

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 8 months ago

Thank you Leon. That is helpful.

I also have another very naïve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address.

But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address?

Oak 

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, June 11, 2024 11:45 AM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin
> Murphy <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> > Hi Jason, Leon,
> >
> > I was able to fix the issue from my side. Things work fine now. I got two
> questions though:
> >
> > 1) The value returned from dma_link_range function is not contiguous, see
> below print. The "linked pa" is the function return.
> > I think dma_map_sgtable API would return some contiguous dma address.
> Is the dma-map_sgtable api is more efficient regarding the iommu page table?
> i.e., try to use bigger page size, such as use 2M page size when it is possible.
> With your new API, does it also have such consideration? I vaguely
> remembered Jason mentioned such thing, but my print below doesn't look
> like so. Maybe I need to test bigger range (only 16 pages range in the test of
> below printing). Comment?
> 
> My API gives you the flexibility to use any page size you want. You can
> use 2M pages instead of 4K pages. The API doesn't enforce any page size.
> 
> >
> > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 18ef3f000
> > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 190d00000
> > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 190024000
> > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 178e89000
> >
> > 2) in the comment of dma_link_range function, it is said: " @dma_offset
> needs to be advanced by the caller with the size of previous page that was
> linked + DMA address returned for the previous page".
> > Is this description correct? I don't understand the part "+ DMA address
> returned for the previous page ".
> > In my codes, let's say I call this function to link 10 pages, the first
> dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the
> previously returned dma address.
> > Maybe I need more test. But any comment?
> 
> You did it perfectly right. This is the correct way to advance dma_offset.
> 
> Thanks
> 
> >
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Monday, June 10, 2024 1:25 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig
> <hch@lst.de>;
> > > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski
> > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>;
> Jens
> > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian,
> Kevin
> > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > > foundation.org>; linux-doc@vger.kernel.org; linux-
> kernel@vger.kernel.org;
> > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > > <bvanassche@acm.org>; Damien Le Moal
> > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > > <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > two steps
> > >
> > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > > Thanks Leon and Yanjun for the reply!
> > > >
> > > > Based on the reply, we will continue use the current version for
> > > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > > it is fully tested/reviewed.
> > >
> > > I'm glad you are finding it useful, one of my interests with this work
> > > is to improve all the HMM users.
> > >
> > > Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Leon Romanovsky 1 year, 8 months ago

On Tue, Jun 11, 2024 at 06:26:23PM +0000, Zeng, Oak wrote:
> Thank you Leon. That is helpful.
> 
> I also have another very naïve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address.

This is right understanding.

> 
> But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address?

This is a combination of two things:
1. Need to support HMM specific logic
2. Outcome of v0 version where I implemented dma_link_range() to perform fallback to DMA direct mode. See patch 2 and 3.
https://lore.kernel.org/all/54a3554639bfb963c9919c5d7c1f449021bebdb3.1709635535.git.leon@kernel.org/
https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/

So dma-iova == 0 means that you are working in direct mode and not with IOMMU, e.g. you can translate from physical address
to DMA address by simple call to phys_to_dma().

One of the comments was that it is not desired behaviour and I need to
create separate functions that will be in use only when IOMMU is used.

See the difference between v0 and v1 for dma_link_range() function.
v0: https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/
v1: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=5aa29f2620ef86ac58c17a0297929a0b9e8d7790

And HMM variant of dma_link_range() function, which saves from you the
need to copy/paste same HMM logic from RDMA to DRM.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=4d8d8d4fbe7891b1412f03ecaff88bc492e2e4eb

Thanks

> 
> Oak 
> 
> > -----Original Message-----
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, June 11, 2024 11:45 AM
> > To: Zeng, Oak <oak.zeng@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>; Christoph Hellwig <hch@lst.de>; Robin
> > Murphy <robin.murphy@arm.com>; Marek Szyprowski
> > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > <bvanassche@acm.org>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> > 
> > On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> > > Hi Jason, Leon,
> > >
> > > I was able to fix the issue from my side. Things work fine now. I got two
> > questions though:
> > >
> > > 1) The value returned from dma_link_range function is not contiguous, see
> > below print. The "linked pa" is the function return.
> > > I think dma_map_sgtable API would return some contiguous dma address.
> > Is the dma-map_sgtable api is more efficient regarding the iommu page table?
> > i.e., try to use bigger page size, such as use 2M page size when it is possible.
> > With your new API, does it also have such consideration? I vaguely
> > remembered Jason mentioned such thing, but my print below doesn't look
> > like so. Maybe I need to test bigger range (only 16 pages range in the test of
> > below printing). Comment?
> > 
> > My API gives you the flexibility to use any page size you want. You can
> > use 2M pages instead of 4K pages. The API doesn't enforce any page size.
> > 
> > >
> > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 18ef3f000
> > > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 190d00000
> > > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 190024000
> > > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 178e89000
> > >
> > > 2) in the comment of dma_link_range function, it is said: " @dma_offset
> > needs to be advanced by the caller with the size of previous page that was
> > linked + DMA address returned for the previous page".
> > > Is this description correct? I don't understand the part "+ DMA address
> > returned for the previous page ".
> > > In my codes, let's say I call this function to link 10 pages, the first
> > dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the
> > previously returned dma address.
> > > Maybe I need more test. But any comment?
> > 
> > You did it perfectly right. This is the correct way to advance dma_offset.
> > 
> > Thanks
> > 
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Monday, June 10, 2024 1:25 PM
> > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig
> > <hch@lst.de>;
> > > > Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski
> > > > <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> > > > Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> > > > Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> > > > <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>;
> > Jens
> > > > Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> > > > Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> > > > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian,
> > Kevin
> > > > <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> > > > Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> > > > foundation.org>; linux-doc@vger.kernel.org; linux-
> > kernel@vger.kernel.org;
> > > > linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> > > > iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> > > > kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> > > > <bvanassche@acm.org>; Damien Le Moal
> > > > <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> > > > <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> > > > <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> > > > <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
> > > > <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
> > > > <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
> > > > <himal.prasad.ghimiray@intel.com>
> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > > two steps
> > > >
> > > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > > > Thanks Leon and Yanjun for the reply!
> > > > >
> > > > > Based on the reply, we will continue use the current version for
> > > > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > > > it is fully tested/reviewed.
> > > >
> > > > I'm glad you are finding it useful, one of my interests with this work
> > > > is to improve all the HMM users.
> > > >
> > > > Jason
>

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zhu Yanjun 1 year, 8 months ago

On 10.06.24 23:28, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I was able to fix the issue from my side. Things work fine now.

Can you enlarge the dma list, then make tests with fio? Not sure if the 
performance is better or not.

Thanks,

Zhu Yanjun

> I got two questions though:
>
> 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
> I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?
>
> [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
> [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
> [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
> [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000
>
> 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
> Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
> In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
> Maybe I need more test. But any comment?
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Monday, June 10, 2024 1:25 PM
>> To: Zeng, Oak <oak.zeng@intel.com>
>> Cc: Leon Romanovsky <leon@kernel.org>; Christoph Hellwig <hch@lst.de>;
>> Robin Murphy <robin.murphy@arm.com>; Marek Szyprowski
>> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
>> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
>> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
>> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
>> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
>> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
>> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
>> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
>> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
>> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
>> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
>> <bvanassche@acm.org>; Damien Le Moal
>> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
>> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
>> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
>> <dan.j.williams@intel.com>; jack@suse.com; Zhu Yanjun
>> <zyjzyj2000@gmail.com>; Bommu, Krishnaiah
>> <krishnaiah.bommu@intel.com>; Ghimiray, Himal Prasad
>> <himal.prasad.ghimiray@intel.com>
>> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
>> two steps
>>
>> On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
>>> Thanks Leon and Yanjun for the reply!
>>>
>>> Based on the reply, we will continue use the current version for
>>> test (as it is tested for vfio and rdma). We will switch to v1 once
>>> it is fully tested/reviewed.
>> I'm glad you are finding it useful, one of my interests with this work
>> is to improve all the HMM users.
>>
>> Jason

-- 
Best

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zhu Yanjun 1 year, 8 months ago

On 10.06.24 17:12, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.
>
> Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Hi, Zeng

I have tested this patch series. And a patch about NVMe will cause some 
call trace. But if you revert this patch about NVMe, the whole patches 
can work well. You can develop your patches based on this patch series.

It seems that "some agreements can not be reached" about NVMe. So NVMe 
patch can not work well. I do not delve into this NVMe patch.

Zhu Yanjun

>
> Cc Himal/Krishna who are also working/testing the new API.
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Friday, May 3, 2024 12:43 PM
>> To: Zeng, Oak <oak.zeng@intel.com>
>> Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
>> <robin.murphy@arm.com>; Marek Szyprowski
>> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
>> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
>> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
>> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
>> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
>> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
>> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
>> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
>> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
>> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
>> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
>> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
>> <bvanassche@acm.org>; Damien Le Moal
>> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
>> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
>> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
>> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
>> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
>> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
>> two steps
>>
>> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
>>
>>>> Instead of teaching DMA to know these specific datatypes, let's separate
>>>> existing DMA mapping routine to two steps and give an option to
>> advanced
>>>> callers (subsystems) perform all calculations internally in advance and
>>>> map pages later when it is needed.
>>> I looked into how this scheme can be applied to DRM subsystem and GPU
>> drivers.
>>> I figured RDMA can apply this scheme because RDMA can calculate the
>>> iova size. Per my limited knowledge of rdma, user can register a
>>> memory region (the reg_user_mr vfunc) and memory region's sized is
>>> used to pre-allocate iova space. And in the RDMA use case, it seems
>>> the user registered region can be very big, e.g., 512MiB or even GiB
>> In RDMA the iova would be linked to the SVA granual we discussed
>> previously.
>>
>>> In GPU driver, we have a few use cases where we need dma-mapping. Just
>> name two:
>>> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
>>> (in Intel's driver it is through a vm_bind api, similar to mmap). A
>>> userptr can be of any random size, depending on user malloc
>>> size. Today we use dma-map-sg for this use case. The down side of
>>> our approach is, during userptr invalidation, even if user only
>>> munmap partially of an userptr, we invalidate the whole userptr from
>>> gpu page table, because there is no way for us to partially
>>> dma-unmap the whole sg list. I think we can try your new API in this
>>> case. The main benefit of the new approach is the partial munmap
>>> case.
>> Yes, this is one of the main things it will improve.
>>
>>> We will have to pre-allocate iova for each userptr, and we have many
>>> userptrs of random size... So we might be not as efficient as RDMA
>>> case where I assume user register a few big memory regions.
>> You are already doing this. dma_map_sg() does exactly the same IOVA
>> allocation under the covers.
>>
>>> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
>>> program directly, without any other extra driver API call. We call
>>> this use case system allocator.
>>> For system allocator, driver have no knowledge of which virtual
>>> address range is valid in advance. So when GPU access a
>>> malloc'ed/mmap'ed address, we have a page fault. We then look up a
>>> CPU vma which contains the fault address. I guess we can use the CPU
>>> vma size to allocate the iova space of the same size?
>> No. You'd follow what we discussed in the other thread.
>>
>> If you do a full SVA then you'd split your MM space into granuals and
>> when a fault hits a granual you'd allocate the IOVA for the whole
>> granual. RDMA ODP is using a 512M granual currently.
>>
>> If you are doing sub ranges then you'd probably allocate the IOVA for
>> the well defined sub range (assuming the typical use case isn't huge)
>>
>>> But there will be a true difficulty to apply your scheme to this use
>>> case. It is related to the STICKY flag. As I understand it, the
>>> sticky flag is designed for driver to mark "this page/pfn has been
>>> populated, no need to re-populate again", roughly...Unlike userptr
>>> and RDMA use cases where the backing store of a buffer is always in
>>> system memory, in the system allocator use case, the backing store
>>> can be changing b/t system memory and GPU's device private
>>> memory. Even worse, we have to assume the data migration b/t system
>>> and GPU is dynamic. When data is migrated to GPU, we don't need
>>> dma-map. And when migration happens to a pfn with STICKY flag, we
>>> still need to repopulate this pfn. So you can see, it is not easy to
>>> apply this scheme to this use case. At least I can't see an obvious
>>> way.
>> You are already doing this today, you are keeping the sg list around
>> until you unmap it.
>>
>> Instead of keeping the sg list you'd keep a much smaller datastructure
>> per-granual. The sticky bit is simply a convient way for ODP to manage
>> the smaller data structure, you don't have to use it.
>>
>> But you do need to keep track of what pages in the granual have been
>> DMA mapped - sg list was doing this before. This could be a simple
>> bitmap array matching the granual size.
>>
>> Looking (far) forward we may be able to have a "replace" API that
>> allows installing a new page unconditionally regardless of what is
>> already there.
>>
>> Jason

-- 
Best Regards,
Yanjun.Zhu

RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zeng, Oak 1 year, 9 months ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, May 3, 2024 12:43 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> 
> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to
> advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > I looked into how this scheme can be applied to DRM subsystem and GPU
> drivers.
> >
> > I figured RDMA can apply this scheme because RDMA can calculate the
> > iova size. Per my limited knowledge of rdma, user can register a
> > memory region (the reg_user_mr vfunc) and memory region's sized is
> > used to pre-allocate iova space. And in the RDMA use case, it seems
> > the user registered region can be very big, e.g., 512MiB or even GiB
> 
> In RDMA the iova would be linked to the SVA granual we discussed
> previously.

I need to learn more of this scheme. 

Let's say 512MiB granual... In a 57-bit virtual address machine, the use space can address space can be up to 56 bit (e.g.,  half-half split b/t kernel and user)

So you would end up with  134,217, 728 sub-regions (2 to the power of 27), which is huge...

Is that RDMA use a much smaller virtual address space?

With 512MiB granual, do you fault-in or map 512MiB virtual address range to RDMA page table? E.g., when page fault happens at address A, do you fault-in the whole 512MiB region to RDMA page table? How do you make sure all addresses in this 512MiB region are valid virtual address?  



> 
> > In GPU driver, we have a few use cases where we need dma-mapping. Just
> name two:
> >
> > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > userptr can be of any random size, depending on user malloc
> > size. Today we use dma-map-sg for this use case. The down side of
> > our approach is, during userptr invalidation, even if user only
> > munmap partially of an userptr, we invalidate the whole userptr from
> > gpu page table, because there is no way for us to partially
> > dma-unmap the whole sg list. I think we can try your new API in this
> > case. The main benefit of the new approach is the partial munmap
> > case.
> 
> Yes, this is one of the main things it will improve.
> 
> > We will have to pre-allocate iova for each userptr, and we have many
> > userptrs of random size... So we might be not as efficient as RDMA
> > case where I assume user register a few big memory regions.
> 
> You are already doing this. dma_map_sg() does exactly the same IOVA
> allocation under the covers.

Sure. Then we can replace our _sg with your new DMA Api once it is merged. We will gain a benefit with a little more codes

> 
> > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > program directly, without any other extra driver API call. We call
> > this use case system allocator.
> 
> > For system allocator, driver have no knowledge of which virtual
> > address range is valid in advance. So when GPU access a
> > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > CPU vma which contains the fault address. I guess we can use the CPU
> > vma size to allocate the iova space of the same size?
> 
> No. You'd follow what we discussed in the other thread.
> 
> If you do a full SVA then you'd split your MM space into granuals and
> when a fault hits a granual you'd allocate the IOVA for the whole
> granual. RDMA ODP is using a 512M granual currently.

Per system allocator requirement, we have to do full SVA (which means ANY valid CPU virtual address is a valid GPU virtual address). 

Per my above calculation, with 512M granual, we will end up a huge number of sub-regions....

> 
> If you are doing sub ranges then you'd probably allocate the IOVA for
> the well defined sub range (assuming the typical use case isn't huge)

Can you explain what is sub ranges? Is that device only mirror partially of the CPU virtual address space?

How do we decide which part to mirror?


> 
> > But there will be a true difficulty to apply your scheme to this use
> > case. It is related to the STICKY flag. As I understand it, the
> > sticky flag is designed for driver to mark "this page/pfn has been
> > populated, no need to re-populate again", roughly...Unlike userptr
> > and RDMA use cases where the backing store of a buffer is always in
> > system memory, in the system allocator use case, the backing store
> > can be changing b/t system memory and GPU's device private
> > memory. Even worse, we have to assume the data migration b/t system
> > and GPU is dynamic. When data is migrated to GPU, we don't need
> > dma-map. And when migration happens to a pfn with STICKY flag, we
> > still need to repopulate this pfn. So you can see, it is not easy to
> > apply this scheme to this use case. At least I can't see an obvious
> > way.
> 
> You are already doing this today, you are keeping the sg list around
> until you unmap it.
> 
> Instead of keeping the sg list you'd keep a much smaller datastructure
> per-granual. The sticky bit is simply a convient way for ODP to manage
> the smaller data structure, you don't have to use it.
> 
> But you do need to keep track of what pages in the granual have been
> DMA mapped - sg list was doing this before. This could be a simple
> bitmap array matching the granual size.

Make sense. We can try once your API is ready. 

I still don't figure out the granular scheme. Please help with above questions.

Thanks,
Oak


> 
> Looking (far) forward we may be able to have a "replace" API that
> allows installing a new page unconditionally regardless of what is
> already there.
> 
> Jason

Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Posted by Zhu Yanjun 1 year, 9 months ago

On 03.05.24 01:32, Zeng, Oak wrote:
> Hi Leon, Jason
>
>> -----Original Message-----
>> From: Leon Romanovsky <leon@kernel.org>
>> Sent: Tuesday, March 5, 2024 6:19 AM
>> To: Christoph Hellwig <hch@lst.de>; Robin Murphy
>> <robin.murphy@arm.com>; Marek Szyprowski
>> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
>> Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya
>> Kulkarni <chaitanyak@nvidia.com>
>> Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>;
>> Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>;
>> Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>; Kevin Tian
>> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
>> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
>> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
>> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
>> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
>> <bvanassche@acm.org>; Damien Le Moal
>> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
>> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
>> <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams
>> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
>> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
>> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
>> steps
>>
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
>> b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
>> and
>> DMA patches are ready for review and inclusion, the NVMe patches are still
>> in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers.
>
> I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB
>
> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
>
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.
>
> We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.
>
> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.
>
> For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?
>
> But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.

Not sure if GPU peer to peer dma mapping GPU memory for use can use this 
scheme or not. If I remember it correctly, Intel Gaudi GPU supports peer 
2 peer dma mapping in GPU Direct RDMA. Not sure if this scheme can be 
applied in that place or not.

Just my 2 cent suggestions.

Zhu Yanjun

>
>
> Oak
>
>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>>     on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>>     directly on BIOs.
>>
>> Thanks
>>
>> [1]
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
>> klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>>    block: add dma_link_range() based API
>>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>>    mm/hmm: let users to tag specific PFNs
>>    dma-mapping: provide an interface to allocate IOVA
>>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>>    iommu/dma: Provide an interface to allow preallocate IOVA
>>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>>    iommu/dma: Implement link/unlink page callbacks
>>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>>    RDMA/umem: Store ODP access mask information in PFN
>>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>>    vfio/mlx5: Explicitly store page list
>>    vfio/mlx5: Convert vfio to use DMA link API
>>
>>   Documentation/core-api/dma-attributes.rst |   7 +
>>   block/blk-merge.c                         | 156 ++++++++++++++
>>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>>   include/linux/blk-mq.h                    |   9 +
>>   include/linux/dma-map-ops.h               |  13 ++
>>   include/linux/dma-mapping.h               |  39 ++++
>>   include/linux/hmm.h                       |   3 +
>>   include/rdma/ib_umem_odp.h                |  22 +-
>>   include/rdma/ib_verbs.h                   |  54 +++++
>>   kernel/dma/debug.h                        |   2 +
>>   kernel/dma/direct.h                       |   7 +-
>>   kernel/dma/mapping.c                      |  91 ++++++++
>>   mm/hmm.c                                  |  34 +--
>>   20 files changed, 870 insertions(+), 605 deletions(-)
>>
>> --
>> 2.44.0

-- 
Best Regards,
Yanjun.Zhu