[PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2

Mikołaj Lenczewski posted 4 patches 11 months, 2 weeks ago
There is a newer version of this series
[PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by Mikołaj Lenczewski 11 months, 2 weeks ago
If we support bbml2 without conflict aborts, we can avoid the final
flush and have hardware manage the tlb entries for us. Avoiding flushes
is a win.

Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/contpte.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 145530f706a9..77ed03b30b72 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
 		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
 
 	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
-
-	if (system_supports_bbml2_noabort())
-		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
 }
 
 void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
-- 
2.45.3

Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by David Hildenbrand 11 months, 1 week ago
On 28.02.25 19:24, Mikołaj Lenczewski wrote:
> If we support bbml2 without conflict aborts, we can avoid the final
> flush and have hardware manage the tlb entries for us. Avoiding flushes
> is a win.
> 
> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/mm/contpte.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 145530f706a9..77ed03b30b72 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>   		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>   
>   	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> -
> -	if (system_supports_bbml2_noabort())
> -		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>   }
>   
>   void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,

What's the point of not squashing this into #2? :)

If this split was requested during earlier review, at least seeing patch 
#2 on its own confused me.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by Mikołaj Lenczewski 11 months, 1 week ago
Hi David,

Thanks for taking the time to review.

On Mon, Mar 03, 2025 at 10:17:12AM +0100, David Hildenbrand wrote:
> On 28.02.25 19:24, Mikołaj Lenczewski wrote:
> > If we support bbml2 without conflict aborts, we can avoid the final
> > flush and have hardware manage the tlb entries for us. Avoiding flushes
> > is a win.
> > 
> > Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
> > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> >   arch/arm64/mm/contpte.c | 3 ---
> >   1 file changed, 3 deletions(-)
> > 
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index 145530f706a9..77ed03b30b72 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >   		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> >   	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> > -
> > -	if (system_supports_bbml2_noabort())
> > -		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> >   }
> >   void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> 
> What's the point of not squashing this into #2? :)
> 
> If this split was requested during earlier review, at least seeing patch #2
> on its own confused me.

This split is a holdover from an earlier patchset, where it was still
unknown whether the removal of the second flush was permitted with
BBML2. Partly this was due to us being worried about conflict aborts
after the removal, and partly this was because the "delay" is a separate
optimisation that we could apply even if it turned out the final patch
was not architecturally sound.

Now that we do not handle conflict aborts (preferring only systems that
handle BBML2 without ever raising aborts), the first issue is not a
problem. The reasoning behind the second patch is also a little bit
outdated, but I can see the logical split between a tlbi reorder, and
the removal of the tlbi. If this is truly redundant though, I would be
happy to squash the two into a single patch.

-- 
Kind regards,
Mikołaj Lenczewski
Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by David Hildenbrand 11 months, 1 week ago
On 03.03.25 10:49, Mikołaj Lenczewski wrote:
> Hi David,
> 
> Thanks for taking the time to review.
> 
> On Mon, Mar 03, 2025 at 10:17:12AM +0100, David Hildenbrand wrote:
>> On 28.02.25 19:24, Mikołaj Lenczewski wrote:
>>> If we support bbml2 without conflict aborts, we can avoid the final
>>> flush and have hardware manage the tlb entries for us. Avoiding flushes
>>> is a win.
>>>
>>> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    arch/arm64/mm/contpte.c | 3 ---
>>>    1 file changed, 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index 145530f706a9..77ed03b30b72 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>    		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>    	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> -
>>> -	if (system_supports_bbml2_noabort())
>>> -		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>    }
>>>    void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>
>> What's the point of not squashing this into #2? :)
>>
>> If this split was requested during earlier review, at least seeing patch #2
>> on its own confused me.
> 
> This split is a holdover from an earlier patchset, where it was still
> unknown whether the removal of the second flush was permitted with
> BBML2. Partly this was due to us being worried about conflict aborts
> after the removal, and partly this was because the "delay" is a separate
> optimisation that we could apply even if it turned out the final patch
> was not architecturally sound.
> 
> Now that we do not handle conflict aborts (preferring only systems that
> handle BBML2 without ever raising aborts), the first issue is not a
> problem. The reasoning behind the second patch is also a little bit
> outdated, but I can see the logical split between a tlbi reorder, and
> the removal of the tlbi. If this is truly redundant though, I would be
> happy to squash the two into a single patch.

Thanks for the information.

Does patch #2 (reordering the tlbi) have any benefit on its own? I read 
"other threads will not see an invalid pagetable entry", but I am not 
sure that is correct. A concurrent HW page table walker would still find 
the invalid PTE? It's just a matter of TLB state.

If there is no benefit in having patch #2 independently, I'd just squash 
them. Reordering to then remove is more complicated than just removing 
it IMHO.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by Mikołaj Lenczewski 11 months, 1 week ago
On Mon, Mar 03, 2025 at 10:57:21AM +0100, David Hildenbrand wrote:
> On 03.03.25 10:49, Mikołaj Lenczewski wrote:
> > Hi David,
> > 
> > Thanks for taking the time to review.
> > 
> > On Mon, Mar 03, 2025 at 10:17:12AM +0100, David Hildenbrand wrote:
> > > On 28.02.25 19:24, Mikołaj Lenczewski wrote:
> > > > If we support bbml2 without conflict aborts, we can avoid the final
> > > > flush and have hardware manage the tlb entries for us. Avoiding flushes
> > > > is a win.
> > > > 
> > > > Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
> > > > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> > > > ---
> > > >    arch/arm64/mm/contpte.c | 3 ---
> > > >    1 file changed, 3 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > > index 145530f706a9..77ed03b30b72 100644
> > > > --- a/arch/arm64/mm/contpte.c
> > > > +++ b/arch/arm64/mm/contpte.c
> > > > @@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> > > >    		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> > > >    	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> > > > -
> > > > -	if (system_supports_bbml2_noabort())
> > > > -		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> > > >    }
> > > >    void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> > > 
> > > What's the point of not squashing this into #2? :)
> > > 
> > > If this split was requested during earlier review, at least seeing patch #2
> > > on its own confused me.
> > 
> > This split is a holdover from an earlier patchset, where it was still
> > unknown whether the removal of the second flush was permitted with
> > BBML2. Partly this was due to us being worried about conflict aborts
> > after the removal, and partly this was because the "delay" is a separate
> > optimisation that we could apply even if it turned out the final patch
> > was not architecturally sound.
> > 
> > Now that we do not handle conflict aborts (preferring only systems that
> > handle BBML2 without ever raising aborts), the first issue is not a
> > problem. The reasoning behind the second patch is also a little bit
> > outdated, but I can see the logical split between a tlbi reorder, and
> > the removal of the tlbi. If this is truly redundant though, I would be
> > happy to squash the two into a single patch.
> 
> Thanks for the information.
> 
> Does patch #2 (reordering the tlbi) have any benefit on its own? I read
> "other threads will not see an invalid pagetable entry", but I am not sure
> that is correct. A concurrent HW page table walker would still find the
> invalid PTE? It's just a matter of TLB state.

I think I understand what you mean. I agree that it is possible for a
concurrent walk to see an invalid TLBI state, if it is on the same TLB
that the repaint is happening on. For other TLBs, the flush has not yet
propagated our invalidated PTEs (from `__ptep_get_and_clear()`) though?
That invalidation will only be seen by other TLBs after the
`__flush_tlb_range()`, so we should save a few faults because only
"local" threads will ever see the invalid entry, as opposed to all
threads that try to read our modified range? Or it is the case that I
have misunderstood something basic here, or that I have misinterpreted
what you have written?

> If there is no benefit in having patch #2 independently, I'd just squash
> them. Reordering to then remove is more complicated than just removing it
> IMHO.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Kind regards,
Mikołaj Lenczewski
Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by David Hildenbrand 11 months, 1 week ago
On 03.03.25 11:55, Mikołaj Lenczewski wrote:
> On Mon, Mar 03, 2025 at 10:57:21AM +0100, David Hildenbrand wrote:
>> On 03.03.25 10:49, Mikołaj Lenczewski wrote:
>>> Hi David,
>>>
>>> Thanks for taking the time to review.
>>>
>>> On Mon, Mar 03, 2025 at 10:17:12AM +0100, David Hildenbrand wrote:
>>>> On 28.02.25 19:24, Mikołaj Lenczewski wrote:
>>>>> If we support bbml2 without conflict aborts, we can avoid the final
>>>>> flush and have hardware manage the tlb entries for us. Avoiding flushes
>>>>> is a win.
>>>>>
>>>>> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>     arch/arm64/mm/contpte.c | 3 ---
>>>>>     1 file changed, 3 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>> index 145530f706a9..77ed03b30b72 100644
>>>>> --- a/arch/arm64/mm/contpte.c
>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>> @@ -72,9 +72,6 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>>>     		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>>>     	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>>>> -
>>>>> -	if (system_supports_bbml2_noabort())
>>>>> -		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>>>>     }
>>>>>     void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
>>>>
>>>> What's the point of not squashing this into #2? :)
>>>>
>>>> If this split was requested during earlier review, at least seeing patch #2
>>>> on its own confused me.
>>>
>>> This split is a holdover from an earlier patchset, where it was still
>>> unknown whether the removal of the second flush was permitted with
>>> BBML2. Partly this was due to us being worried about conflict aborts
>>> after the removal, and partly this was because the "delay" is a separate
>>> optimisation that we could apply even if it turned out the final patch
>>> was not architecturally sound.
>>>
>>> Now that we do not handle conflict aborts (preferring only systems that
>>> handle BBML2 without ever raising aborts), the first issue is not a
>>> problem. The reasoning behind the second patch is also a little bit
>>> outdated, but I can see the logical split between a tlbi reorder, and
>>> the removal of the tlbi. If this is truly redundant though, I would be
>>> happy to squash the two into a single patch.
>>
>> Thanks for the information.
>>
>> Does patch #2 (reordering the tlbi) have any benefit on its own? I read
>> "other threads will not see an invalid pagetable entry", but I am not sure
>> that is correct. A concurrent HW page table walker would still find the
>> invalid PTE? It's just a matter of TLB state.
> 
> I think I understand what you mean. I agree that it is possible for a
> concurrent walk to see an invalid TLBI state, if it is on the same TLB
> that the repaint is happening on. For other TLBs, the flush has not yet
> propagated our invalidated PTEs (from `__ptep_get_and_clear()`) though?

What I am saying is: if there is no TLB entry yet, HW will walk the page 
table to find no present PTE and trigger a fault.

> That invalidation will only be seen by other TLBs after the
> `__flush_tlb_range()`, so we should save a few faults because only
> "local" threads will ever see the invalid entry, as opposed to all
> threads that try to read our modified range?

So what you say is, that deferring the flush means that if there is 
already a TLB entry, flushing deferred reduces the likelihood that a 
page table walk is triggered that could find no present PTE: 
consequently, reducing the likelihood that a page fault is triggered.

(I use the word likelihood, because I assume other action could result 
in a TLB entry getting flushed in the meantime, such as TLB entry reuse)

Correct?

Or it is the case that I
> have misunderstood something basic here, or that I have misinterpreted
> what you have written?

No, that makes it clearer, thanks.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] arm64/mm: Elide tlbi in contpte_convert() under BBML2
Posted by Mikołaj Lenczewski 11 months, 1 week ago
> > I think I understand what you mean. I agree that it is possible for a
> > concurrent walk to see an invalid TLBI state, if it is on the same TLB
> > that the repaint is happening on. For other TLBs, the flush has not yet
> > propagated our invalidated PTEs (from `__ptep_get_and_clear()`) though?
> 
> What I am saying is: if there is no TLB entry yet, HW will walk the page
> table to find no present PTE and trigger a fault.

Yes, that is 100% correct. I believe that this is unavoidable.

> > That invalidation will only be seen by other TLBs after the
> > `__flush_tlb_range()`, so we should save a few faults because only
> > "local" threads will ever see the invalid entry, as opposed to all
> > threads that try to read our modified range?
> 
> So what you say is, that deferring the flush means that if there is already
> a TLB entry, flushing deferred reduces the likelihood that a page table walk
> is triggered that could find no present PTE: consequently, reducing the
> likelihood that a page fault is triggered.
> 
> (I use the word likelihood, because I assume other action could result in a
> TLB entry getting flushed in the meantime, such as TLB entry reuse)
> 
> Correct?

Yes, and your language here is clearer than the original commit message
(and cover letter). Will amend it to be closer to your wording.

-- 
Kind regards,
Mikołaj Lenczewski