dma-buf: system_heap: add PTE_CONT for larger contiguous

[RFC] dma-buf: system_heap: add PTE_CONT for larger contiguous

Posted by gao xu 2 months ago

commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings
instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap
allocates pages of order 4 and 8 that meet the alignment requirements for
PTE_CONT. enabling PTE_CONT for larger contiguous mappings.

After applying this patch, TLB misses are reduced by approximately 5% when
opening the camera on Android systems.

Signed-off-by: gao xu <gaoxu2@honor.com>
---
 drivers/dma-buf/heaps/system_heap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index 4c782fe33..103b06f89 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -202,12 +202,16 @@ static int system_heap_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
 		unsigned long n = (sg->length >> PAGE_SHIFT) - pgoff;
 		struct page *page = sg_page(sg) + pgoff;
 		unsigned long size = n << PAGE_SHIFT;
+		pgprot_t prot = vma->vm_page_prot;
 
 		if (addr + size > vma->vm_end)
 			size = vma->vm_end - addr;
 
+		if (((addr | size) & ~CONT_PTE_MASK) == 0)
+			prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+
 		ret = remap_pfn_range(vma, addr, page_to_pfn(page),
-				size, vma->vm_page_prot);
+				size, prot);
 		if (ret)
 			return ret;
 
-- 
2.42.0

Re: [RFC] dma-buf: system_heap: add PTE_CONT for larger contiguous

Posted by Barry Song 2 months ago

On Mon, Dec 8, 2025 at 5:41 PM gao xu <gaoxu2@honor.com> wrote:
>
> commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings
> instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap
> allocates pages of order 4 and 8 that meet the alignment requirements for
> PTE_CONT. enabling PTE_CONT for larger contiguous mappings.

Unfortunately, we don't have pte_cont for architectures other than
AArch64. On the other hand, AArch64 isn't automatically mapping
cont_pte for mmap. It might be better if this were done
automatically by the ARM code.

Ryan(Cced) is the expert on automatically setting cont_pte for
contiguous mapping, so let's ask for some advice from Ryan.

>
> After applying this patch, TLB misses are reduced by approximately 5% when
> opening the camera on Android systems.
>
> Signed-off-by: gao xu <gaoxu2@honor.com>
> ---

Thanks
Barry

Re: [RFC] dma-buf: system_heap: add PTE_CONT for larger contiguous

Posted by Ryan Roberts 2 months ago

On 08/12/2025 09:52, Barry Song wrote:
> On Mon, Dec 8, 2025 at 5:41 PM gao xu <gaoxu2@honor.com> wrote:
>>
>> commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings
>> instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap
>> allocates pages of order 4 and 8 that meet the alignment requirements for
>> PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
> 
> Unfortunately, we don't have pte_cont for architectures other than
> AArch64. On the other hand, AArch64 isn't automatically mapping
> cont_pte for mmap. It might be better if this were done
> automatically by the ARM code.

Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be
used outside of the arm64 arch code.

> 
> Ryan(Cced) is the expert on automatically setting cont_pte for
> contiguous mapping, so let's ask for some advice from Ryan.

arm64 arch code will automatically and transparently apply PTE_CONT whenever it
detects suitable conditions. Those suitable conditions include:

  - physically contiguous block of 64K, aligned to 64K
  - virtually contiguous block of 64K, aligned to 64K
  - 64K block has the same access permissions
  - 64K block all belongs to the same folio
  - not a special mapping

The last 2 requirements are the tricky ones here: We require that every page in
the block belongs to the same folio because a contigous mapping only maintains a
single access and dirty bit for the whole 64K block, so we are losing fidelity
vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra
fidelity we get for per-page mappings is ingored by the kernel anyway if the
contiguous mapping only maps pages from a single folio. We reject special
mappings because they are not backed by a folio at all.

For your case, remap_pfn_range() will create special mappings so we will never
set the PTE_CONT bit.

Likely we are being a bit too conservative here and we may be able to relax this
requirement if we know that nothing will ever consume the access/dirty
information for special mappings? I'm not if that is the case in general though
- it would need some investigation.

With that issue resolved, there is still a second issue; there are 2 ways the
arm64 arch code detects suitable contiguous mappings. The primary way is via a
call to set_ptes(). This part of the "PTE batching" API and explicitly tells the
implementaiton that all the conditions are met (including the memory being
backed by a folio). This is the most efficient approach. See contpte_set_ptes().

There is a second (hacky) approach which attempts to recognise when the last PTE
of a contiguous block is set and automatically "fold" the mapping. See
contpte_try_fold(). This approach has a cost because (for systems without
BBML2_NOABORT) we have to issue a TLBI when we fold the range.

For remap_pfn_range(), we would be relying on the second approach since it is
not currently batched (and could not use set_ptes() as currently spec'ed due to
there being no folio). If we are going to add support for contiguous pfn-mapped
PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).

I think this would be a useful improvement, but it's not as straightforward as
adding PTE_CONT in system_heap_mmap().

Thanks,
Ryan

> 
>>
>> After applying this patch, TLB misses are reduced by approximately 5% when
>> opening the camera on Android systems.
>>
>> Signed-off-by: gao xu <gaoxu2@honor.com>
>> ---
> 
> Thanks
> Barry

Re: [RFC] dma-buf: system_heap: add PTE_CONT for larger contiguous

Posted by Barry Song 2 months ago

On Mon, Dec 8, 2025 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 08/12/2025 09:52, Barry Song wrote:
> > On Mon, Dec 8, 2025 at 5:41 PM gao xu <gaoxu2@honor.com> wrote:
> >>
> >> commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings
> >> instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap
> >> allocates pages of order 4 and 8 that meet the alignment requirements for
> >> PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
> >
> > Unfortunately, we don't have pte_cont for architectures other than
> > AArch64. On the other hand, AArch64 isn't automatically mapping
> > cont_pte for mmap. It might be better if this were done
> > automatically by the ARM code.
>
> Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be
> used outside of the arm64 arch code.
>
> >
> > Ryan(Cced) is the expert on automatically setting cont_pte for
> > contiguous mapping, so let's ask for some advice from Ryan.
>
> arm64 arch code will automatically and transparently apply PTE_CONT whenever it
> detects suitable conditions. Those suitable conditions include:
>
>   - physically contiguous block of 64K, aligned to 64K
>   - virtually contiguous block of 64K, aligned to 64K
>   - 64K block has the same access permissions
>   - 64K block all belongs to the same folio
>   - not a special mapping
>
> The last 2 requirements are the tricky ones here: We require that every page in
> the block belongs to the same folio because a contigous mapping only maintains a
> single access and dirty bit for the whole 64K block, so we are losing fidelity
> vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra
> fidelity we get for per-page mappings is ingored by the kernel anyway if the
> contiguous mapping only maps pages from a single folio. We reject special
> mappings because they are not backed by a folio at all.
>
> For your case, remap_pfn_range() will create special mappings so we will never
> set the PTE_CONT bit.
>
> Likely we are being a bit too conservative here and we may be able to relax this
> requirement if we know that nothing will ever consume the access/dirty
> information for special mappings? I'm not if that is the case in general though
> - it would need some investigation.
>
> With that issue resolved, there is still a second issue; there are 2 ways the
> arm64 arch code detects suitable contiguous mappings. The primary way is via a
> call to set_ptes(). This part of the "PTE batching" API and explicitly tells the
> implementaiton that all the conditions are met (including the memory being
> backed by a folio). This is the most efficient approach. See contpte_set_ptes().
>
> There is a second (hacky) approach which attempts to recognise when the last PTE
> of a contiguous block is set and automatically "fold" the mapping. See
> contpte_try_fold(). This approach has a cost because (for systems without
> BBML2_NOABORT) we have to issue a TLBI when we fold the range.
>
> For remap_pfn_range(), we would be relying on the second approach since it is
> not currently batched (and could not use set_ptes() as currently spec'ed due to
> there being no folio). If we are going to add support for contiguous pfn-mapped
> PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).
>

Thanks a lot, Ryan. It seems quite tricky to support automatic cont_pte.

> I think this would be a useful improvement, but it's not as straightforward as
> adding PTE_CONT in system_heap_mmap().

Since it's just a driver, I'm not sure if it's acceptable to use CONFIG_ARM64.
However, I can find many instances of it in drivers.
drivers % git grep CONFIG_ARM64 | wc -l
     127

On the other hand, a corner case is when the dma-buf is partially unmapped.
I assume cont_pte can still be automatically unfolded, even for
special mappings?

Thanks
Barry

Re: [RFC] dma-buf: system_heap: add PTE_CONT for larger contiguous

Posted by Ryan Roberts 2 months ago

On 09/12/2025 11:37, Barry Song wrote:
> On Mon, Dec 8, 2025 at 6:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 08/12/2025 09:52, Barry Song wrote:
>>> On Mon, Dec 8, 2025 at 5:41 PM gao xu <gaoxu2@honor.com> wrote:
>>>>
>>>> commit 04c7adb5871a ("dma-buf: system_heap: use larger contiguous mappings
>>>> instead of per-page mmap") facilitates the use of PTE_CONT. The system_heap
>>>> allocates pages of order 4 and 8 that meet the alignment requirements for
>>>> PTE_CONT. enabling PTE_CONT for larger contiguous mappings.
>>>
>>> Unfortunately, we don't have pte_cont for architectures other than
>>> AArch64. On the other hand, AArch64 isn't automatically mapping
>>> cont_pte for mmap. It might be better if this were done
>>> automatically by the ARM code.
>>
>> Yes indeed; CONT_PTE_MASK and PTE_CONT are arm64-specific macros that cannot be
>> used outside of the arm64 arch code.
>>
>>>
>>> Ryan(Cced) is the expert on automatically setting cont_pte for
>>> contiguous mapping, so let's ask for some advice from Ryan.
>>
>> arm64 arch code will automatically and transparently apply PTE_CONT whenever it
>> detects suitable conditions. Those suitable conditions include:
>>
>>   - physically contiguous block of 64K, aligned to 64K
>>   - virtually contiguous block of 64K, aligned to 64K
>>   - 64K block has the same access permissions
>>   - 64K block all belongs to the same folio
>>   - not a special mapping
>>
>> The last 2 requirements are the tricky ones here: We require that every page in
>> the block belongs to the same folio because a contigous mapping only maintains a
>> single access and dirty bit for the whole 64K block, so we are losing fidelity
>> vs per-page mappings. But the kernel tracks access/dirty per folio, so the extra
>> fidelity we get for per-page mappings is ingored by the kernel anyway if the
>> contiguous mapping only maps pages from a single folio. We reject special
>> mappings because they are not backed by a folio at all.
>>
>> For your case, remap_pfn_range() will create special mappings so we will never
>> set the PTE_CONT bit.
>>
>> Likely we are being a bit too conservative here and we may be able to relax this
>> requirement if we know that nothing will ever consume the access/dirty
>> information for special mappings? I'm not if that is the case in general though
>> - it would need some investigation.
>>
>> With that issue resolved, there is still a second issue; there are 2 ways the
>> arm64 arch code detects suitable contiguous mappings. The primary way is via a
>> call to set_ptes(). This part of the "PTE batching" API and explicitly tells the
>> implementaiton that all the conditions are met (including the memory being
>> backed by a folio). This is the most efficient approach. See contpte_set_ptes().
>>
>> There is a second (hacky) approach which attempts to recognise when the last PTE
>> of a contiguous block is set and automatically "fold" the mapping. See
>> contpte_try_fold(). This approach has a cost because (for systems without
>> BBML2_NOABORT) we have to issue a TLBI when we fold the range.
>>
>> For remap_pfn_range(), we would be relying on the second approach since it is
>> not currently batched (and could not use set_ptes() as currently spec'ed due to
>> there being no folio). If we are going to add support for contiguous pfn-mapped
>> PTEs, it would be preferable to add equivalent batching APIs (or relax set_ptes()).
>>
> 
> Thanks a lot, Ryan. It seems quite tricky to support automatic cont_pte.
> 
>> I think this would be a useful improvement, but it's not as straightforward as
>> adding PTE_CONT in system_heap_mmap().
> 
> Since it's just a driver, I'm not sure if it's acceptable to use CONFIG_ARM64.
> However, I can find many instances of it in drivers.
> drivers % git grep CONFIG_ARM64 | wc -l
>      127
> 
> On the other hand, a corner case is when the dma-buf is partially unmapped.
> I assume cont_pte can still be automatically unfolded, even for
> special mappings?

I think unfolding will probably happen to work, but you're definitely in the
neighbourhood of "horrible hack that may not work as intended in some corner cases".

I think it would be much better to support batching for pfn-mapped ptes. That
would generalize to many more users. (and I might be interested in taking a look
at some point next year if nobody else gets to it).

We deliberately didn't want to expose the idea of a single, specific contiguous
size to the generic code so that the arch could make more fine-grained decisions. :)

Thanks,
Ryan



> 
> Thanks
> Barry