[PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd

Huang Ying posted 2 patches 2 months, 1 week ago
There is a newer version of this series
[PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by Huang Ying 2 months, 1 week ago
In the current kernel, there is spurious fault fixing support for pte,
but not for huge pmd because no architectures need it. But in the
next patch in the series, we will change the write protection fault
handling logic on arm64, so that some stale huge pmd entries may
remain in the TLB. These entries need to be flushed via the huge pmd
spurious fault fixing mechanism.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/pgtable.h |  4 ++++
 mm/huge_memory.c        | 22 +++++++++++++++++-----
 mm/internal.h           |  4 ++--
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 32e8457ad535..341622ec80e4 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
 #endif
 
+#ifndef flush_tlb_fix_spurious_fault_pmd
+#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..8533457c52b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write)
+/* Returns whether the PMD entry is changed */
+int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	      pmd_t *pmd, bool write)
 {
+	int changed;
 	pmd_t _pmd;
 
 	_pmd = pmd_mkyoung(*pmd);
 	if (write)
 		_pmd = pmd_mkdirty(_pmd);
-	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
-				  pmd, _pmd, write))
+	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
+					pmd, _pmd, write);
+	if (changed)
 		update_mmu_cache_pmd(vma, addr, pmd);
+
+	return changed;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
 	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
 		goto unlock;
 
-	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
+	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
+		/* See corresponding comments in handle_pte_fault(). */
+		if (vmf->flags & FAULT_FLAG_TRIED)
+			goto unlock;
+		if (vmf->flags & FAULT_FLAG_WRITE)
+			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
+							 vmf->pmd);
+	}
 
 unlock:
 	spin_unlock(vmf->ptl);
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..8b58ab00a7cd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
  */
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write);
+int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	      pmd_t *pmd, bool write);
 
 /*
  * Parses a string with mem suffixes into its order. Useful to parse kernel
-- 
2.39.5
Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by Lorenzo Stoakes 2 months ago
On Mon, Oct 13, 2025 at 05:20:37PM +0800, Huang Ying wrote:
> In the current kernel, there is spurious fault fixing support for pte,
> but not for huge pmd because no architectures need it. But in the
> next patch in the series, we will change the write protection fault
> handling logic on arm64, so that some stale huge pmd entries may
> remain in the TLB. These entries need to be flushed via the huge pmd
> spurious fault fixing mechanism.
>
> Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>

Right now the PTE level spurious fault handling is dealt with in
handle_pte_fault() when ptep_set_access_flags() returns false.

Now you're updating touch_pmd() which is invoked by follow_huge_pmd() and
huge_pmd_set_accessed().

1 - Why are you not adding handling to GUP?

2 - Is this the correct level of abstraction? It's really not obvious but
    huge_pmd_set_accessed() is invoked by __handle_mm_fault() on a non-WP,
    non-NUMA hint huge page fault where a page table entry already exists
    but we are faulting anyway (e.g. non-present or read-only writable).

You don't mention any of this in the commit message, which you need to do
and really need to explain how spurious faults can arise, why you can only
do this at the point of abstraction you do (if you are unable to put it in
actual fault handing-code), and you need to add a bunch more comments to
explain this.

Otherwise this just ends up being a lot of open-coded + confusing 'you have
to go look it up/just know' type stuff that we have too much of in mm :)

So please update commit message/comments, confirm whether this is the
correct level of abstraction, and address other comments below, thanks!

> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Yicong Yang <yangyicong@hisilicon.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Kevin Brodsky <kevin.brodsky@arm.com>
> Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/pgtable.h |  4 ++++
>  mm/huge_memory.c        | 22 +++++++++++++++++-----
>  mm/internal.h           |  4 ++--
>  3 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 32e8457ad535..341622ec80e4 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1232,6 +1232,10 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
>  #endif
>
> +#ifndef flush_tlb_fix_spurious_fault_pmd
> +#define flush_tlb_fix_spurious_fault_pmd(vma, address, ptep) do { } while (0)
> +#endif

flush_tlb_fix_spurious_fault(), when the arch doesn't declare it, defaults to
flush_tlb_page() - why do we just do nothing in this case here?

> +
>  /*
>   * When walking page tables, get the address of the next boundary,
>   * or the end address of the range if that comes earlier.  Although no
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225..8533457c52b7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1641,17 +1641,22 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> -	       pmd_t *pmd, bool write)
> +/* Returns whether the PMD entry is changed */

Could we have a kernel doc description here?

> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,

It's 2025 can we use bool please :)

> +	      pmd_t *pmd, bool write)
>  {
> +	int changed;
>  	pmd_t _pmd;

While we're here can we rename this horrible parameter name to e.g. entry? We're
significantly altering this function anyway so it isn't much more

>
>  	_pmd = pmd_mkyoung(*pmd);
>  	if (write)
>  		_pmd = pmd_mkdirty(_pmd);
> -	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> -				  pmd, _pmd, write))
> +	changed = pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
> +					pmd, _pmd, write);
> +	if (changed)
>  		update_mmu_cache_pmd(vma, addr, pmd);

We can make this simpler, e.g.:

	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
				  pmd, entry, write)) {
		update_mmu_cache_pmd(vma, addr, pmd);
		return true;
	}

	return false;

> +
> +	return changed;
>  }
>
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -1849,7 +1854,14 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
>  	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
>  		goto unlock;
>
> -	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
> +	if (!touch_pmd(vmf->vma, vmf->address, vmf->pmd, write)) {
> +		/* See corresponding comments in handle_pte_fault(). */

What are the 'corresponding' comments? How can a reader of this code know what
they are? This isn't a very helpful comment. Also those comments might be
moved in future...

Presumably it's:

		/* Skip spurious TLB flush for retried page fault */
		if (vmf->flags & FAULT_FLAG_TRIED)
			goto unlock;
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (vmf->flags & FAULT_FLAG_WRITE)
			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
						     vmf->pte);


So I don't see why it's so egregious to have the equivalent here, or actually
ideally to abstract the code entirely.

In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
pgtable_level"") David introduced:

	enum pgtable_level {
		PGTABLE_LEVEL_PTE = 0,
		PGTABLE_LEVEL_PMD,
		PGTABLE_LEVEL_PUD,
		PGTABLE_LEVEL_P4D,
		PGTABLE_LEVEL_PGD,
	};

Which allows for sensible abstraction.

> +		if (vmf->flags & FAULT_FLAG_TRIED)
> +			goto unlock;
> +		if (vmf->flags & FAULT_FLAG_WRITE)
> +			flush_tlb_fix_spurious_fault_pmd(vmf->vma, vmf->address,
> +							 vmf->pmd);
> +	}
>
>  unlock:
>  	spin_unlock(vmf->ptl);
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..8b58ab00a7cd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1402,8 +1402,8 @@ int __must_check try_grab_folio(struct folio *folio, int refs,
>   */
>  void touch_pud(struct vm_area_struct *vma, unsigned long addr,
>  	       pud_t *pud, bool write);
> -void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> -	       pmd_t *pmd, bool write);
> +int touch_pmd(struct vm_area_struct *vma, unsigned long addr,
> +	      pmd_t *pmd, bool write);
>
>  /*
>   * Parses a string with mem suffixes into its order. Useful to parse kernel
> --
> 2.39.5
>
Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by David Hildenbrand 2 months ago
> 
> 		/* Skip spurious TLB flush for retried page fault */
> 		if (vmf->flags & FAULT_FLAG_TRIED)
> 			goto unlock;
> 		/*
> 		 * This is needed only for protection faults but the arch code
> 		 * is not yet telling us if this is a protection fault or not.
> 		 * This still avoids useless tlb flushes for .text page faults
> 		 * with threads.
> 		 */
> 		if (vmf->flags & FAULT_FLAG_WRITE)
> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> 						     vmf->pte);
> 
> 
> So I don't see why it's so egregious to have the equivalent here, or actually
> ideally to abstract the code entirely.

Let's definitely not duplicate such comments whereby one instance will 
end up bitrotting.

When talking about spurious faults I assume the educated reader will 
usually find the right comments -- like you easily did :P

However I agree that ...

> 
> In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> pgtable_level"") David introduced:
> 
> 	enum pgtable_level {
> 		PGTABLE_LEVEL_PTE = 0,
> 		PGTABLE_LEVEL_PMD,
> 		PGTABLE_LEVEL_PUD,
> 		PGTABLE_LEVEL_P4D,
> 		PGTABLE_LEVEL_PGD,
> 	};
> 
> Which allows for sensible abstraction.

... if there is an easier way to just unify the code and have the 
comments at a central place, even better.

-- 
Cheers

David / dhildenb
Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by Lorenzo Stoakes 2 months ago
On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
>
> >
> > 		/* Skip spurious TLB flush for retried page fault */
> > 		if (vmf->flags & FAULT_FLAG_TRIED)
> > 			goto unlock;
> > 		/*
> > 		 * This is needed only for protection faults but the arch code
> > 		 * is not yet telling us if this is a protection fault or not.
> > 		 * This still avoids useless tlb flushes for .text page faults
> > 		 * with threads.
> > 		 */
> > 		if (vmf->flags & FAULT_FLAG_WRITE)
> > 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > 						     vmf->pte);
> >
> >
> > So I don't see why it's so egregious to have the equivalent here, or actually
> > ideally to abstract the code entirely.
>
> Let's definitely not duplicate such comments whereby one instance will end
> up bitrotting.

We're duplicating the code in two places, how would that bitrot happen exactly?

I mean as I also say, probably better to just de-duplicate in general.

It's one thing referring to a comment _above_ another function, it's quite
another to refer to comments buried in a branch buried inside another function
and to hand wave about it.

And _those_ comments might very well be moved when the function is sensibly
refactored (as it needs to be tbh).

So yeah, in general I'd agree _if_ you were doing something
similar-but-not-the-same AND referencing a clearly identifiable comment
(e.g. above the function).

But this isn't that at all.

Anyway TL;DR is that we should just de-dupe the code.

>
> When talking about spurious faults I assume the educated reader will usually
> find the right comments -- like you easily did :P

Yeah but I'm familiar with the (kinda hideous) code there, it wasn't so much
'easily' found and for somebody else they'd maybe get confused as to what on
earth that's referring to.

>
> However I agree that ...
>
> >
> > In commit b22cc9a9c7ff ("mm/rmap: convert "enum rmap_level" to "enum
> > pgtable_level"") David introduced:
> >
> > 	enum pgtable_level {
> > 		PGTABLE_LEVEL_PTE = 0,
> > 		PGTABLE_LEVEL_PMD,
> > 		PGTABLE_LEVEL_PUD,
> > 		PGTABLE_LEVEL_P4D,
> > 		PGTABLE_LEVEL_PGD,
> > 	};
> >
> > Which allows for sensible abstraction.
>
> ... if there is an easier way to just unify the code and have the comments
> at a central place, even better.

Yup we're agreed on that :)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by David Hildenbrand 2 months ago
On 14.10.25 16:49, Lorenzo Stoakes wrote:
> On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
>>
>>>
>>> 		/* Skip spurious TLB flush for retried page fault */
>>> 		if (vmf->flags & FAULT_FLAG_TRIED)
>>> 			goto unlock;
>>> 		/*
>>> 		 * This is needed only for protection faults but the arch code
>>> 		 * is not yet telling us if this is a protection fault or not.
>>> 		 * This still avoids useless tlb flushes for .text page faults
>>> 		 * with threads.
>>> 		 */
>>> 		if (vmf->flags & FAULT_FLAG_WRITE)
>>> 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
>>> 						     vmf->pte);
>>>
>>>
>>> So I don't see why it's so egregious to have the equivalent here, or actually
>>> ideally to abstract the code entirely.
>>
>> Let's definitely not duplicate such comments whereby one instance will end
>> up bitrotting.
> 
> We're duplicating the code in two places, how would that bitrot happen exactly?

Often we adjust/fix comments to make scenarios/conditions clearer or 
extend them to cover some new conditions.

So even without any code changes people will just ignore to update other 
comments.

Code you can at least test with the hope to find inconsistencies.

So copying rather large comments is usually never the answer :)

Well, just like copying larger chunks of code, agreed.

-- 
Cheers

David / dhildenb
Re: [PATCH -v2 1/2] mm: add spurious fault fixing support for huge pmd
Posted by Lorenzo Stoakes 2 months ago
On Tue, Oct 14, 2025 at 04:58:44PM +0200, David Hildenbrand wrote:
> On 14.10.25 16:49, Lorenzo Stoakes wrote:
> > On Tue, Oct 14, 2025 at 04:38:05PM +0200, David Hildenbrand wrote:
> > >
> > > >
> > > > 		/* Skip spurious TLB flush for retried page fault */
> > > > 		if (vmf->flags & FAULT_FLAG_TRIED)
> > > > 			goto unlock;
> > > > 		/*
> > > > 		 * This is needed only for protection faults but the arch code
> > > > 		 * is not yet telling us if this is a protection fault or not.
> > > > 		 * This still avoids useless tlb flushes for .text page faults
> > > > 		 * with threads.
> > > > 		 */
> > > > 		if (vmf->flags & FAULT_FLAG_WRITE)
> > > > 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
> > > > 						     vmf->pte);
> > > >
> > > >
> > > > So I don't see why it's so egregious to have the equivalent here, or actually
> > > > ideally to abstract the code entirely.
> > >
> > > Let's definitely not duplicate such comments whereby one instance will end
> > > up bitrotting.
> >
> > We're duplicating the code in two places, how would that bitrot happen exactly?
>
> Often we adjust/fix comments to make scenarios/conditions clearer or extend
> them to cover some new conditions.
>
> So even without any code changes people will just ignore to update other
> comments.
>
> Code you can at least test with the hope to find inconsistencies.
>
> So copying rather large comments is usually never the answer :)
>
> Well, just like copying larger chunks of code, agreed.

This is a bit moot I don't think it's worth having a big debate about.

I'm one of the biggest proponents of de-duplicating things (comments
included) and have done so extensively as you know.

My _only_ point here is that it's hard to find the comment referenced and
it's _very_ likely it'll get moved later (in fact I feel like doing that
_right now_ as that function needs refactoring).

In that case the lesser evil is to copy a 4 line comment right?

But anyway, we both agree de-duplicating the code as a whole is the right
way forward and that solves the issue so let's go with that!

>
> --
> Cheers
>
> David / dhildenb
>
>

Cheers, Lorenzo