dma-mapping: arm64: support batched cache sync

[PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Barry Song 1 month, 2 weeks ago

From: Barry Song <baohua@kernel.org>

Instead of performing a flush per SG entry, issue all cache
operations first and then flush once. This ultimately benefits
__dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 kernel/dma/direct.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index a219911c7b90..98bacf562ca1 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -402,12 +402,12 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
-		if (!dev_is_dma_coherent(dev)) {
+		if (!dev_is_dma_coherent(dev))
 			arch_sync_dma_for_device(paddr, sg->length,
 					dir);
-			arch_sync_dma_flush();
-		}
 	}
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 }
 #endif
 
@@ -423,10 +423,8 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 	for_each_sg(sgl, sg, nents, i) {
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-		if (!dev_is_dma_coherent(dev)) {
+		if (!dev_is_dma_coherent(dev))
 			arch_sync_dma_for_cpu(paddr, sg->length, dir);
-			arch_sync_dma_flush();
-		}
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
@@ -434,8 +432,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 			arch_dma_mark_clean(paddr, sg->length);
 	}
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
+		arch_sync_dma_flush();
 		arch_sync_dma_for_cpu_all();
+	}
 }
 
 /*
-- 
2.43.0

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Leon Romanovsky 1 month, 2 weeks ago

On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
> 
> Instead of performing a flush per SG entry, issue all cache
> operations first and then flush once. This ultimately benefits
> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> 
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
> ---
>  kernel/dma/direct.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)

<...>

> -		if (!dev_is_dma_coherent(dev)) {
> +		if (!dev_is_dma_coherent(dev))
>  			arch_sync_dma_for_device(paddr, sg->length,
>  					dir);
> -			arch_sync_dma_flush();
> -		}
>  	}
> +	if (!dev_is_dma_coherent(dev))
> +		arch_sync_dma_flush();

This patch should be squashed into the previous one. You introduced
arch_sync_dma_flush() there, and now you are placing it elsewhere.

Thanks

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Barry Song 1 month, 2 weeks ago

On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Instead of performing a flush per SG entry, issue all cache
> > operations first and then flush once. This ultimately benefits
> > __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> >
> > Cc: Leon Romanovsky <leon@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <baohua@kernel.org>
> > ---
> >  kernel/dma/direct.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
>
> <...>
>
> > -             if (!dev_is_dma_coherent(dev)) {
> > +             if (!dev_is_dma_coherent(dev))
> >                       arch_sync_dma_for_device(paddr, sg->length,
> >                                       dir);
> > -                     arch_sync_dma_flush();
> > -             }
> >       }
> > +     if (!dev_is_dma_coherent(dev))
> > +             arch_sync_dma_flush();
>
> This patch should be squashed into the previous one. You introduced
> arch_sync_dma_flush() there, and now you are placing it elsewhere.

Hi Leon,

The previous patch replaces all arch_sync_dma_for_* calls with
arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
functional change. The subsequent patches then implement the
actual batching. I feel this is a better approach for reviewing
each change independently. Otherwise, the previous patch would
be too large.

Thanks
Barry

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Leon Romanovsky 1 month, 1 week ago

On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > > From: Barry Song <baohua@kernel.org>
> > >
> > > Instead of performing a flush per SG entry, issue all cache
> > > operations first and then flush once. This ultimately benefits
> > > __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > >
> > > Cc: Leon Romanovsky <leon@kernel.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > ---
> > >  kernel/dma/direct.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > <...>
> >
> > > -             if (!dev_is_dma_coherent(dev)) {
> > > +             if (!dev_is_dma_coherent(dev))
> > >                       arch_sync_dma_for_device(paddr, sg->length,
> > >                                       dir);
> > > -                     arch_sync_dma_flush();
> > > -             }
> > >       }
> > > +     if (!dev_is_dma_coherent(dev))
> > > +             arch_sync_dma_flush();
> >
> > This patch should be squashed into the previous one. You introduced
> > arch_sync_dma_flush() there, and now you are placing it elsewhere.
> 
> Hi Leon,
> 
> The previous patch replaces all arch_sync_dma_for_* calls with
> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> functional change. The subsequent patches then implement the
> actual batching. I feel this is a better approach for reviewing
> each change independently. Otherwise, the previous patch would
> be too large.

Don't worry about it. Your patches are small enough.

> 
> Thanks
> Barry

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Barry Song 1 month ago

On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> > On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > > > From: Barry Song <baohua@kernel.org>
> > > >
> > > > Instead of performing a flush per SG entry, issue all cache
> > > > operations first and then flush once. This ultimately benefits
> > > > __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > > >
> > > > Cc: Leon Romanovsky <leon@kernel.org>
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > > ---
> > > >  kernel/dma/direct.c | 14 +++++++-------
> > > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > >
> > > <...>
> > >
> > > > -             if (!dev_is_dma_coherent(dev)) {
> > > > +             if (!dev_is_dma_coherent(dev))
> > > >                       arch_sync_dma_for_device(paddr, sg->length,
> > > >                                       dir);
> > > > -                     arch_sync_dma_flush();
> > > > -             }
> > > >       }
> > > > +     if (!dev_is_dma_coherent(dev))
> > > > +             arch_sync_dma_flush();
> > >
> > > This patch should be squashed into the previous one. You introduced
> > > arch_sync_dma_flush() there, and now you are placing it elsewhere.
> >
> > Hi Leon,
> >
> > The previous patch replaces all arch_sync_dma_for_* calls with
> > arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> > functional change. The subsequent patches then implement the
> > actual batching. I feel this is a better approach for reviewing
> > each change independently. Otherwise, the previous patch would
> > be too large.
>
> Don't worry about it. Your patches are small enough.

My hardware does not require a bounce buffer, but I am concerned that
this patch may be incorrect for systems that do require one.

Now it is:

void dma_direct_sync_sg_for_cpu(struct device *dev,
                struct scatterlist *sgl, int nents, enum dma_data_direction dir)
{
        struct scatterlist *sg;
        int i;

        for_each_sg(sgl, sg, nents, i) {
                phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

                if (!dev_is_dma_coherent(dev))
                        arch_sync_dma_for_cpu(paddr, sg->length, dir);

                swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

                if (dir == DMA_FROM_DEVICE)
                        arch_dma_mark_clean(paddr, sg->length);
        }

        if (!dev_is_dma_coherent(dev)) {
                arch_sync_dma_flush();
                arch_sync_dma_for_cpu_all();
        }
}

Should we call swiotlb_sync_single_for_cpu() and
arch_dma_mark_clean() after the flush to ensure the CPU sees the
latest data and that the memcpy is correct? I mean:

void dma_direct_sync_sg_for_cpu(struct device *dev,
                struct scatterlist *sgl, int nents, enum dma_data_direction dir)
{
        struct scatterlist *sg;
        int i;

        for_each_sg(sgl, sg, nents, i) {
                phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

                if (!dev_is_dma_coherent(dev))
                        arch_sync_dma_for_cpu(paddr, sg->length, dir);
        }

        if (!dev_is_dma_coherent(dev)) {
                arch_sync_dma_flush();
                arch_sync_dma_for_cpu_all();
        }

        for_each_sg(sgl, sg, nents, i) {
                phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

                swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

                if (dir == DMA_FROM_DEVICE)
                        arch_dma_mark_clean(paddr, sg->length);
        }
}

Could this be the same issue for dma_direct_unmap_sg()?

Another option is to not support batched synchronization for the bounce
buffer case, since it is rare. In that case, it could be:

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 550a1a13148d..a4840f7e8722 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
        for_each_sg(sgl, sg, nents, i) {
                phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

-               if (!dev_is_dma_coherent(dev))
+               if (!dev_is_dma_coherent(dev)) {
                        arch_sync_dma_for_cpu(paddr, sg->length, dir);
+                       if (unlikely(dev->dma_io_tlb_mem))
+                               arch_sync_dma_flush();
+               }

                swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

What’s your view on this, Leon?

Thanks
Barry

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Robin Murphy 1 month ago

On 2026-01-06 6:41 pm, Barry Song wrote:
> On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
>>
>> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
>>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>>
>>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
>>>>> From: Barry Song <baohua@kernel.org>
>>>>>
>>>>> Instead of performing a flush per SG entry, issue all cache
>>>>> operations first and then flush once. This ultimately benefits
>>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
>>>>>
>>>>> Cc: Leon Romanovsky <leon@kernel.org>
>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>>> Cc: Will Deacon <will@kernel.org>
>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>>>>> Signed-off-by: Barry Song <baohua@kernel.org>
>>>>> ---
>>>>>   kernel/dma/direct.c | 14 +++++++-------
>>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> <...>
>>>>
>>>>> -             if (!dev_is_dma_coherent(dev)) {
>>>>> +             if (!dev_is_dma_coherent(dev))
>>>>>                        arch_sync_dma_for_device(paddr, sg->length,
>>>>>                                        dir);
>>>>> -                     arch_sync_dma_flush();
>>>>> -             }
>>>>>        }
>>>>> +     if (!dev_is_dma_coherent(dev))
>>>>> +             arch_sync_dma_flush();
>>>>
>>>> This patch should be squashed into the previous one. You introduced
>>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
>>>
>>> Hi Leon,
>>>
>>> The previous patch replaces all arch_sync_dma_for_* calls with
>>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
>>> functional change. The subsequent patches then implement the
>>> actual batching. I feel this is a better approach for reviewing
>>> each change independently. Otherwise, the previous patch would
>>> be too large.
>>
>> Don't worry about it. Your patches are small enough.
> 
> My hardware does not require a bounce buffer, but I am concerned that
> this patch may be incorrect for systems that do require one.
> 
> Now it is:
> 
> void dma_direct_sync_sg_for_cpu(struct device *dev,
>                  struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> {
>          struct scatterlist *sg;
>          int i;
> 
>          for_each_sg(sgl, sg, nents, i) {
>                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
>                  if (!dev_is_dma_coherent(dev))
>                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
> 
>                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> 
>                  if (dir == DMA_FROM_DEVICE)
>                          arch_dma_mark_clean(paddr, sg->length);
>          }
> 
>          if (!dev_is_dma_coherent(dev)) {
>                  arch_sync_dma_flush();
>                  arch_sync_dma_for_cpu_all();
>          }
> }
> 
> Should we call swiotlb_sync_single_for_cpu() and
> arch_dma_mark_clean() after the flush to ensure the CPU sees the
> latest data and that the memcpy is correct? I mean:

Yes, this and the equivalents in the later patches are broken for all 
the sync_for_cpu and unmap paths which may end up bouncing (beware some 
of them get a bit fiddly) - any cache maintenance *must* be completed 
before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing, 
and appears to be entirely dead now.

Thanks,
Robin.

> void dma_direct_sync_sg_for_cpu(struct device *dev,
>                  struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> {
>          struct scatterlist *sg;
>          int i;
> 
>          for_each_sg(sgl, sg, nents, i) {
>                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
>                  if (!dev_is_dma_coherent(dev))
>                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
>          }
> 
>          if (!dev_is_dma_coherent(dev)) {
>                  arch_sync_dma_flush();
>                  arch_sync_dma_for_cpu_all();
>          }
> 
>          for_each_sg(sgl, sg, nents, i) {
>                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
>                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> 
>                  if (dir == DMA_FROM_DEVICE)
>                          arch_dma_mark_clean(paddr, sg->length);
>          }
> }
> 
> Could this be the same issue for dma_direct_unmap_sg()?
> 
> Another option is to not support batched synchronization for the bounce
> buffer case, since it is rare. In that case, it could be:
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 550a1a13148d..a4840f7e8722 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>          for_each_sg(sgl, sg, nents, i) {
>                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
> -               if (!dev_is_dma_coherent(dev))
> +               if (!dev_is_dma_coherent(dev)) {
>                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
> +                       if (unlikely(dev->dma_io_tlb_mem))
> +                               arch_sync_dma_flush();
> +               }
> 
>                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> 
> What’s your view on this, Leon?
> 
> Thanks
> Barry

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Barry Song 1 month ago

On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2026-01-06 6:41 pm, Barry Song wrote:
> > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
> >>
> >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> >>>>
> >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> >>>>> From: Barry Song <baohua@kernel.org>
> >>>>>
> >>>>> Instead of performing a flush per SG entry, issue all cache
> >>>>> operations first and then flush once. This ultimately benefits
> >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> >>>>>
> >>>>> Cc: Leon Romanovsky <leon@kernel.org>
> >>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
> >>>>> Cc: Will Deacon <will@kernel.org>
> >>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >>>>> Cc: Robin Murphy <robin.murphy@arm.com>
> >>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> >>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
> >>>>> Cc: Marc Zyngier <maz@kernel.org>
> >>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >>>>> Cc: Suren Baghdasaryan <surenb@google.com>
> >>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> >>>>> Signed-off-by: Barry Song <baohua@kernel.org>
> >>>>> ---
> >>>>>   kernel/dma/direct.c | 14 +++++++-------
> >>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
> >>>>
> >>>> <...>
> >>>>
> >>>>> -             if (!dev_is_dma_coherent(dev)) {
> >>>>> +             if (!dev_is_dma_coherent(dev))
> >>>>>                        arch_sync_dma_for_device(paddr, sg->length,
> >>>>>                                        dir);
> >>>>> -                     arch_sync_dma_flush();
> >>>>> -             }
> >>>>>        }
> >>>>> +     if (!dev_is_dma_coherent(dev))
> >>>>> +             arch_sync_dma_flush();
> >>>>
> >>>> This patch should be squashed into the previous one. You introduced
> >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
> >>>
> >>> Hi Leon,
> >>>
> >>> The previous patch replaces all arch_sync_dma_for_* calls with
> >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> >>> functional change. The subsequent patches then implement the
> >>> actual batching. I feel this is a better approach for reviewing
> >>> each change independently. Otherwise, the previous patch would
> >>> be too large.
> >>
> >> Don't worry about it. Your patches are small enough.
> >
> > My hardware does not require a bounce buffer, but I am concerned that
> > this patch may be incorrect for systems that do require one.
> >
> > Now it is:
> >
> > void dma_direct_sync_sg_for_cpu(struct device *dev,
> >                  struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> > {
> >          struct scatterlist *sg;
> >          int i;
> >
> >          for_each_sg(sgl, sg, nents, i) {
> >                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> >
> >                  if (!dev_is_dma_coherent(dev))
> >                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
> >
> >                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> >
> >                  if (dir == DMA_FROM_DEVICE)
> >                          arch_dma_mark_clean(paddr, sg->length);
> >          }
> >
> >          if (!dev_is_dma_coherent(dev)) {
> >                  arch_sync_dma_flush();
> >                  arch_sync_dma_for_cpu_all();
> >          }
> > }
> >
> > Should we call swiotlb_sync_single_for_cpu() and
> > arch_dma_mark_clean() after the flush to ensure the CPU sees the
> > latest data and that the memcpy is correct? I mean:
>
> Yes, this and the equivalents in the later patches are broken for all
> the sync_for_cpu and unmap paths which may end up bouncing (beware some
> of them get a bit fiddly) - any cache maintenance *must* be completed
> before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
> and appears to be entirely dead now.

Thanks, Robin. Personally, I would prefer an approach like the one below—
that is, not optimizing the bounce buffer cases, as they are already slow
due to hardware limitations with memcpy, and optimizing them would make
the code quite messy.

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 550a1a13148d..a4840f7e8722 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
        for_each_sg(sgl, sg, nents, i) {
                phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));

-               if (!dev_is_dma_coherent(dev))
+               if (!dev_is_dma_coherent(dev)) {
                        arch_sync_dma_for_cpu(paddr, sg->length, dir);
+                       if (unlikely(dev->dma_io_tlb_mem))
+                               arch_sync_dma_flush();
+               }

                swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);

I’d like to check with you, Leon, and Marek on your views about this.

Thanks
Barry

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Robin Murphy 1 month ago

On 2026-01-06 7:47 pm, Barry Song wrote:
> On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2026-01-06 6:41 pm, Barry Song wrote:
>>> On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>>
>>>> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
>>>>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>>>>
>>>>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
>>>>>>> From: Barry Song <baohua@kernel.org>
>>>>>>>
>>>>>>> Instead of performing a flush per SG entry, issue all cache
>>>>>>> operations first and then flush once. This ultimately benefits
>>>>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
>>>>>>>
>>>>>>> Cc: Leon Romanovsky <leon@kernel.org>
>>>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>>>>> Cc: Will Deacon <will@kernel.org>
>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
>>>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>>>>>>> Signed-off-by: Barry Song <baohua@kernel.org>
>>>>>>> ---
>>>>>>>    kernel/dma/direct.c | 14 +++++++-------
>>>>>>>    1 file changed, 7 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> <...>
>>>>>>
>>>>>>> -             if (!dev_is_dma_coherent(dev)) {
>>>>>>> +             if (!dev_is_dma_coherent(dev))
>>>>>>>                         arch_sync_dma_for_device(paddr, sg->length,
>>>>>>>                                         dir);
>>>>>>> -                     arch_sync_dma_flush();
>>>>>>> -             }
>>>>>>>         }
>>>>>>> +     if (!dev_is_dma_coherent(dev))
>>>>>>> +             arch_sync_dma_flush();
>>>>>>
>>>>>> This patch should be squashed into the previous one. You introduced
>>>>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
>>>>>
>>>>> Hi Leon,
>>>>>
>>>>> The previous patch replaces all arch_sync_dma_for_* calls with
>>>>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
>>>>> functional change. The subsequent patches then implement the
>>>>> actual batching. I feel this is a better approach for reviewing
>>>>> each change independently. Otherwise, the previous patch would
>>>>> be too large.
>>>>
>>>> Don't worry about it. Your patches are small enough.
>>>
>>> My hardware does not require a bounce buffer, but I am concerned that
>>> this patch may be incorrect for systems that do require one.
>>>
>>> Now it is:
>>>
>>> void dma_direct_sync_sg_for_cpu(struct device *dev,
>>>                   struct scatterlist *sgl, int nents, enum dma_data_direction dir)
>>> {
>>>           struct scatterlist *sg;
>>>           int i;
>>>
>>>           for_each_sg(sgl, sg, nents, i) {
>>>                   phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
>>>
>>>                   if (!dev_is_dma_coherent(dev))
>>>                           arch_sync_dma_for_cpu(paddr, sg->length, dir);
>>>
>>>                   swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
>>>
>>>                   if (dir == DMA_FROM_DEVICE)
>>>                           arch_dma_mark_clean(paddr, sg->length);
>>>           }
>>>
>>>           if (!dev_is_dma_coherent(dev)) {
>>>                   arch_sync_dma_flush();
>>>                   arch_sync_dma_for_cpu_all();
>>>           }
>>> }
>>>
>>> Should we call swiotlb_sync_single_for_cpu() and
>>> arch_dma_mark_clean() after the flush to ensure the CPU sees the
>>> latest data and that the memcpy is correct? I mean:
>>
>> Yes, this and the equivalents in the later patches are broken for all
>> the sync_for_cpu and unmap paths which may end up bouncing (beware some
>> of them get a bit fiddly) - any cache maintenance *must* be completed
>> before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
>> and appears to be entirely dead now.
> 
> Thanks, Robin. Personally, I would prefer an approach like the one below—
> that is, not optimizing the bounce buffer cases, as they are already slow
> due to hardware limitations with memcpy, and optimizing them would make
> the code quite messy.
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 550a1a13148d..a4840f7e8722 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>          for_each_sg(sgl, sg, nents, i) {
>                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
> -               if (!dev_is_dma_coherent(dev))
> +               if (!dev_is_dma_coherent(dev)) {
>                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
> +                       if (unlikely(dev->dma_io_tlb_mem))
> +                               arch_sync_dma_flush();
> +               }
> 
>                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> 
> I’d like to check with you, Leon, and Marek on your views about this.

That doesn't work, since dma_io_tlb_mem is always initialised if a 
SWIOTLB buffer exists at all. Similarly I think the existing 
dma_need_sync tracking is also too coarse, as that's also always going 
to be true for a non-coherent device.

Really this flush wants to be after the swiotlb_find_pool() check in the 
swiotlb_tbl_unmap_single()/__swiotlb_sync_single_for_cpu() paths, as 
that's the only point we know for sure it's definitely needed for the 
given address. It would then be rather fiddly to avoid 
potentially-redundant flushes for the non-sg cases (and the final 
segment of an sg), but as you already mentioned, if it's limited to 
cases when we *are* already paying the cost of bouncing anyway, perhaps 
one extra DSB isn't *too* bad if it means zero impact to the 
non-bouncing paths.

Thanks,
Robin.

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Marek Szyprowski 1 month ago

On 07.01.2026 14:16, Robin Murphy wrote:
> On 2026-01-06 7:47 pm, Barry Song wrote:
>> On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> 
>> wrote:
>>>
>>> On 2026-01-06 6:41 pm, Barry Song wrote:
>>>> On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> 
>>>> wrote:
>>>>>
>>>>> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
>>>>>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> 
>>>>>> wrote:
>>>>>>>
>>>>>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
>>>>>>>> From: Barry Song <baohua@kernel.org>
>>>>>>>>
>>>>>>>> Instead of performing a flush per SG entry, issue all cache
>>>>>>>> operations first and then flush once. This ultimately benefits
>>>>>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
>>>>>>>>
>>>>>>>> Cc: Leon Romanovsky <leon@kernel.org>
>>>>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>>>>>> Cc: Will Deacon <will@kernel.org>
>>>>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>>>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
>>>>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>>>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>>>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>>>>>>>> Signed-off-by: Barry Song <baohua@kernel.org>
>>>>>>>> ---
>>>>>>>>    kernel/dma/direct.c | 14 +++++++-------
>>>>>>>>    1 file changed, 7 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> <...>
>>>>>>>
>>>>>>>> -             if (!dev_is_dma_coherent(dev)) {
>>>>>>>> +             if (!dev_is_dma_coherent(dev))
>>>>>>>> arch_sync_dma_for_device(paddr, sg->length,
>>>>>>>>                                         dir);
>>>>>>>> -                     arch_sync_dma_flush();
>>>>>>>> -             }
>>>>>>>>         }
>>>>>>>> +     if (!dev_is_dma_coherent(dev))
>>>>>>>> +             arch_sync_dma_flush();
>>>>>>>
>>>>>>> This patch should be squashed into the previous one. You introduced
>>>>>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
>>>>>>
>>>>>> Hi Leon,
>>>>>>
>>>>>> The previous patch replaces all arch_sync_dma_for_* calls with
>>>>>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
>>>>>> functional change. The subsequent patches then implement the
>>>>>> actual batching. I feel this is a better approach for reviewing
>>>>>> each change independently. Otherwise, the previous patch would
>>>>>> be too large.
>>>>>
>>>>> Don't worry about it. Your patches are small enough.
>>>>
>>>> My hardware does not require a bounce buffer, but I am concerned that
>>>> this patch may be incorrect for systems that do require one.
>>>>
>>>> Now it is:
>>>>
>>>> void dma_direct_sync_sg_for_cpu(struct device *dev,
>>>>                   struct scatterlist *sgl, int nents, enum 
>>>> dma_data_direction dir)
>>>> {
>>>>           struct scatterlist *sg;
>>>>           int i;
>>>>
>>>>           for_each_sg(sgl, sg, nents, i) {
>>>>                   phys_addr_t paddr = dma_to_phys(dev, 
>>>> sg_dma_address(sg));
>>>>
>>>>                   if (!dev_is_dma_coherent(dev))
>>>>                           arch_sync_dma_for_cpu(paddr, sg->length, 
>>>> dir);
>>>>
>>>>                   swiotlb_sync_single_for_cpu(dev, paddr, 
>>>> sg->length, dir);
>>>>
>>>>                   if (dir == DMA_FROM_DEVICE)
>>>>                           arch_dma_mark_clean(paddr, sg->length);
>>>>           }
>>>>
>>>>           if (!dev_is_dma_coherent(dev)) {
>>>>                   arch_sync_dma_flush();
>>>>                   arch_sync_dma_for_cpu_all();
>>>>           }
>>>> }
>>>>
>>>> Should we call swiotlb_sync_single_for_cpu() and
>>>> arch_dma_mark_clean() after the flush to ensure the CPU sees the
>>>> latest data and that the memcpy is correct? I mean:
>>>
>>> Yes, this and the equivalents in the later patches are broken for all
>>> the sync_for_cpu and unmap paths which may end up bouncing (beware some
>>> of them get a bit fiddly) - any cache maintenance *must* be completed
>>> before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 
>>> thing,
>>> and appears to be entirely dead now.
>>
>> Thanks, Robin. Personally, I would prefer an approach like the one 
>> below—
>> that is, not optimizing the bounce buffer cases, as they are already 
>> slow
>> due to hardware limitations with memcpy, and optimizing them would make
>> the code quite messy.
>>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index 550a1a13148d..a4840f7e8722 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>>          for_each_sg(sgl, sg, nents, i) {
>>                  phys_addr_t paddr = dma_to_phys(dev, 
>> sg_dma_address(sg));
>>
>> -               if (!dev_is_dma_coherent(dev))
>> +               if (!dev_is_dma_coherent(dev)) {
>>                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
>> +                       if (unlikely(dev->dma_io_tlb_mem))
>> +                               arch_sync_dma_flush();
>> +               }
>>
>>                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, 
>> dir);
>>
>> I’d like to check with you, Leon, and Marek on your views about this.
>
> That doesn't work, since dma_io_tlb_mem is always initialised if a 
> SWIOTLB buffer exists at all. Similarly I think the existing 
> dma_need_sync tracking is also too coarse, as that's also always going 
> to be true for a non-coherent device.
>
> Really this flush wants to be after the swiotlb_find_pool() check in 
> the swiotlb_tbl_unmap_single()/__swiotlb_sync_single_for_cpu() paths, 
> as that's the only point we know for sure it's definitely needed for 
> the given address. It would then be rather fiddly to avoid 
> potentially-redundant flushes for the non-sg cases (and the final 
> segment of an sg), but as you already mentioned, if it's limited to 
> cases when we *are* already paying the cost of bouncing anyway, 
> perhaps one extra DSB isn't *too* bad if it means zero impact to the 
> non-bouncing paths.

I agree with Robin, optimizing the swiotlb path doesn't make much sense.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*

Posted by Leon Romanovsky 1 month ago

On Wed, Jan 07, 2026 at 08:47:36AM +1300, Barry Song wrote:
> On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 2026-01-06 6:41 pm, Barry Song wrote:
> > > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>
> > >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> > >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>>>
> > >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > >>>>> From: Barry Song <baohua@kernel.org>
> > >>>>>
> > >>>>> Instead of performing a flush per SG entry, issue all cache
> > >>>>> operations first and then flush once. This ultimately benefits
> > >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > >>>>>
> > >>>>> Cc: Leon Romanovsky <leon@kernel.org>
> > >>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
> > >>>>> Cc: Will Deacon <will@kernel.org>
> > >>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > >>>>> Cc: Robin Murphy <robin.murphy@arm.com>
> > >>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > >>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >>>>> Cc: Marc Zyngier <maz@kernel.org>
> > >>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> > >>>>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > >>>>> Signed-off-by: Barry Song <baohua@kernel.org>
> > >>>>> ---
> > >>>>>   kernel/dma/direct.c | 14 +++++++-------
> > >>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> <...>
> > >>>>
> > >>>>> -             if (!dev_is_dma_coherent(dev)) {
> > >>>>> +             if (!dev_is_dma_coherent(dev))
> > >>>>>                        arch_sync_dma_for_device(paddr, sg->length,
> > >>>>>                                        dir);
> > >>>>> -                     arch_sync_dma_flush();
> > >>>>> -             }
> > >>>>>        }
> > >>>>> +     if (!dev_is_dma_coherent(dev))
> > >>>>> +             arch_sync_dma_flush();
> > >>>>
> > >>>> This patch should be squashed into the previous one. You introduced
> > >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
> > >>>
> > >>> Hi Leon,
> > >>>
> > >>> The previous patch replaces all arch_sync_dma_for_* calls with
> > >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> > >>> functional change. The subsequent patches then implement the
> > >>> actual batching. I feel this is a better approach for reviewing
> > >>> each change independently. Otherwise, the previous patch would
> > >>> be too large.
> > >>
> > >> Don't worry about it. Your patches are small enough.
> > >
> > > My hardware does not require a bounce buffer, but I am concerned that
> > > this patch may be incorrect for systems that do require one.
> > >
> > > Now it is:
> > >
> > > void dma_direct_sync_sg_for_cpu(struct device *dev,
> > >                  struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> > > {
> > >          struct scatterlist *sg;
> > >          int i;
> > >
> > >          for_each_sg(sgl, sg, nents, i) {
> > >                  phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> > >
> > >                  if (!dev_is_dma_coherent(dev))
> > >                          arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > >
> > >                  swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> > >
> > >                  if (dir == DMA_FROM_DEVICE)
> > >                          arch_dma_mark_clean(paddr, sg->length);
> > >          }
> > >
> > >          if (!dev_is_dma_coherent(dev)) {
> > >                  arch_sync_dma_flush();
> > >                  arch_sync_dma_for_cpu_all();
> > >          }
> > > }
> > >
> > > Should we call swiotlb_sync_single_for_cpu() and
> > > arch_dma_mark_clean() after the flush to ensure the CPU sees the
> > > latest data and that the memcpy is correct? I mean:
> >
> > Yes, this and the equivalents in the later patches are broken for all
> > the sync_for_cpu and unmap paths which may end up bouncing (beware some
> > of them get a bit fiddly) - any cache maintenance *must* be completed
> > before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
> > and appears to be entirely dead now.
> 
> Thanks, Robin. Personally, I would prefer an approach like the one below—
> that is, not optimizing the bounce buffer cases, as they are already slow
> due to hardware limitations with memcpy, and optimizing them would make
> the code quite messy.
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 550a1a13148d..a4840f7e8722 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>         for_each_sg(sgl, sg, nents, i) {
>                 phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> 
> -               if (!dev_is_dma_coherent(dev))
> +               if (!dev_is_dma_coherent(dev)) {
>                         arch_sync_dma_for_cpu(paddr, sg->length, dir);
> +                       if (unlikely(dev->dma_io_tlb_mem))
> +                               arch_sync_dma_flush();
> +               }
> 
>                 swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> 
> I’d like to check with you, Leon, and Marek on your views about this.

I agree with your point that the non‑SWIOTLB path is the performant one and
should be preferred. My concern is that you are accessing the
dma_io_tlb_mem variable directly from direct.c, which looks like a layer
violation.

You likely need to introduce an is_swiotlb_something() helper for this.

BTW, please send a v3 instead of posting incremental follow‑ups.
It's hard to track the changes across multiple small additions.

Thanks.
> 
> Thanks
> Barry