[PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()

David Hildenbrand posted 9 patches 2 months, 3 weeks ago
There is a newer version of this series
[PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by David Hildenbrand 2 months, 3 weeks ago
print_bad_pte() looks like something that should actually be a WARN
or similar, but historically it apparently has proven to be useful to
detect corruption of page tables even on production systems -- report
the issue and keep the system running to make it easier to actually detect
what is going wrong (e.g., multiple such messages might shed a light).

As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
to take care of print_bad_pte() as well.

Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
implementation and renaming the function -- we'll rename it to what
we actually print: bad (page) mappings. Maybe it should be called
"print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
because the assumption is that we are dealing with some (previously)
present page table entry that got corrupted in weird ways.

Whether it is a PTE or something else will usually become obvious from the
page table dump or from the dumped stack. If ever required in the future,
we could pass the entry level type similar to "enum rmap_level". For now,
let's keep it simple.

To make the function a bit more readable, factor out the ratelimit check
into is_bad_page_map_ratelimited() and place the dumping of page
table content into __dump_bad_page_map_pgtable(). We'll now dump
information from each level in a single line, and just stop the table
walk once we hit something that is not a present page table.

Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
for vm_normal_page(), now that we have a function that can handle it.

The report will now look something like (dumping pgd to pmd values):

[   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
[   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
[   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067

Not using pgdp_get(), because that does not work properly on some arm
configs where pgd_t is an array. Note that we are dumping all levels
even when levels are folded for simplicity.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 94 insertions(+), 26 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 173eb6267e0ac..08d16ed7b4cc7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 			add_mm_counter(mm, i, rss[i]);
 }
 
-/*
- * This function is called to print an error when a bad pte
- * is found. For example, we might have a PFN-mapped pte in
- * a region that doesn't allow it.
- *
- * The calling function must still handle the error.
- */
-static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
-			  pte_t pte, struct page *page)
+static bool is_bad_page_map_ratelimited(void)
 {
-	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
-	p4d_t *p4d = p4d_offset(pgd, addr);
-	pud_t *pud = pud_offset(p4d, addr);
-	pmd_t *pmd = pmd_offset(pud, addr);
-	struct address_space *mapping;
-	pgoff_t index;
 	static unsigned long resume;
 	static unsigned long nr_shown;
 	static unsigned long nr_unshown;
@@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	if (nr_shown == 60) {
 		if (time_before(jiffies, resume)) {
 			nr_unshown++;
-			return;
+			return true;
 		}
 		if (nr_unshown) {
 			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
@@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	}
 	if (nr_shown++ == 0)
 		resume = jiffies + 60 * HZ;
+	return false;
+}
+
+static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long long pgdv, p4dv, pudv, pmdv;
+	p4d_t p4d, *p4dp;
+	pud_t pud, *pudp;
+	pmd_t pmd, *pmdp;
+	pgd_t *pgdp;
+
+	/*
+	 * This looks like a fully lockless walk, however, the caller is
+	 * expected to hold the leaf page table lock in addition to other
+	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
+	 * content while any concurrent modifications should be completely
+	 * prevented.
+	 */
+	pgdp = pgd_offset(mm, addr);
+	pgdv = pgd_val(*pgdp);
+
+	if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
+		pr_alert("pgd:%08llx\n", pgdv);
+		return;
+	}
+
+	p4dp = p4d_offset(pgdp, addr);
+	p4d = p4dp_get(p4dp);
+	p4dv = p4d_val(p4d);
+
+	if (!p4d_present(p4d) || p4d_leaf(p4d)) {
+		pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
+		return;
+	}
+
+	pudp = pud_offset(p4dp, addr);
+	pud = pudp_get(pudp);
+	pudv = pud_val(pud);
+
+	if (!pud_present(pud) || pud_leaf(pud)) {
+		pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
+		return;
+	}
+
+	pmdp = pmd_offset(pudp, addr);
+	pmd = pmdp_get(pmdp);
+	pmdv = pmd_val(pmd);
+
+	/*
+	 * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
+	 * because the table should already be mapped by the caller and
+	 * doing another map would be bad. print_bad_page_map() should
+	 * already take care of printing the PTE.
+	 */
+	pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
+		 p4dv, pudv, pmdv);
+}
+
+/*
+ * This function is called to print an error when a bad page table entry (e.g.,
+ * corrupted page table entry) is found. For example, we might have a
+ * PFN-mapped pte in a region that doesn't allow it.
+ *
+ * The calling function must still handle the error.
+ */
+static void print_bad_page_map(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long long entry, struct page *page)
+{
+	struct address_space *mapping;
+	pgoff_t index;
+
+	if (is_bad_page_map_ratelimited())
+		return;
 
 	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
 	index = linear_page_index(vma, addr);
 
-	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
-		 current->comm,
-		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
+	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
+	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
 	if (page)
-		dump_page(page, "bad pte");
+		dump_page(page, "bad page map");
 	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
 		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
 	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
@@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		if (is_zero_pfn(pfn))
 			return NULL;
 
-		print_bad_pte(vma, addr, pte, NULL);
+		print_bad_page_map(vma, addr, pte_val(pte), NULL);
 		return NULL;
 	}
 
@@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 
 check_pfn:
 	if (unlikely(pfn > highest_memmap_pfn)) {
-		print_bad_pte(vma, addr, pte, NULL);
+		print_bad_page_map(vma, addr, pte_val(pte), NULL);
 		return NULL;
 	}
 
@@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	if (unlikely(pmd_special(pmd)))
+	if (unlikely(pmd_special(pmd))) {
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+		if (is_huge_zero_pfn(pfn))
+			return NULL;
+
+		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
 		return NULL;
+	}
 
 	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
 		if (vma->vm_flags & VM_MIXEDMAP) {
@@ -674,8 +739,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	if (is_huge_zero_pfn(pfn))
 		return NULL;
-	if (unlikely(pfn > highest_memmap_pfn))
+	if (unlikely(pfn > highest_memmap_pfn)) {
+		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
 		return NULL;
+	}
 
 	/*
 	 * NOTE! We still have PageReserved() pages in the page tables.
@@ -1509,7 +1576,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 		folio_remove_rmap_ptes(folio, page, nr, vma);
 
 		if (unlikely(folio_mapcount(folio) < 0))
-			print_bad_pte(vma, addr, ptent, page);
+			print_bad_page_map(vma, addr, pte_val(ptent), page);
 	}
 	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
 		*force_flush = true;
@@ -4507,7 +4574,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		} else if (is_pte_marker_entry(entry)) {
 			ret = handle_pte_marker(vmf);
 		} else {
-			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
+			print_bad_page_map(vma, vmf->address,
+					   pte_val(vmf->orig_pte), NULL);
 			ret = VM_FAULT_SIGBUS;
 		}
 		goto out;
-- 
2.50.1
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by Demi Marie Obenour 2 months, 2 weeks ago
On 7/17/25 07:52, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
> 
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
> 
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
> 
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
> 
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
> 
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
> 
> The report will now look something like (dumping pgd to pmd values):
> 
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> 
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Should this still use a WARN?  If the admin sets panic-on-warn they
have asked for "crash if anything goes wrong" and so that is what
they should get.  Otherwise the system will still stay up.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by David Hildenbrand 2 months, 2 weeks ago
On 18.07.25 00:06, Demi Marie Obenour wrote:
> On 7/17/25 07:52, David Hildenbrand wrote:
>> print_bad_pte() looks like something that should actually be a WARN
>> or similar, but historically it apparently has proven to be useful to
>> detect corruption of page tables even on production systems -- report
>> the issue and keep the system running to make it easier to actually detect
>> what is going wrong (e.g., multiple such messages might shed a light).
>>
>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>> to take care of print_bad_pte() as well.
>>
>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>> implementation and renaming the function -- we'll rename it to what
>> we actually print: bad (page) mappings. Maybe it should be called
>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>> because the assumption is that we are dealing with some (previously)
>> present page table entry that got corrupted in weird ways.
>>
>> Whether it is a PTE or something else will usually become obvious from the
>> page table dump or from the dumped stack. If ever required in the future,
>> we could pass the entry level type similar to "enum rmap_level". For now,
>> let's keep it simple.
>>
>> To make the function a bit more readable, factor out the ratelimit check
>> into is_bad_page_map_ratelimited() and place the dumping of page
>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>> information from each level in a single line, and just stop the table
>> walk once we hit something that is not a present page table.
>>
>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>> for vm_normal_page(), now that we have a function that can handle it.
>>
>> The report will now look something like (dumping pgd to pmd values):
>>
>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>
>> Not using pgdp_get(), because that does not work properly on some arm
>> configs where pgd_t is an array. Note that we are dumping all levels
>> even when levels are folded for simplicity.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Should this still use a WARN?  If the admin sets panic-on-warn they
> have asked for "crash if anything goes wrong" and so that is what
> they should get.  Otherwise the system will still stay up.

I assume you're comment is in context of the other proposal regarding 
panicking.

It's a good question whether we should WARN: likely we should convert 
the "BUG:" ... message into a WARN. On panic-on-warn you'd panic 
immediately without being able to observe any other such messages (and 
as discussed in the RFC, apparently that can be valuable for debugging, 
because a single such report is often insufficient)

But as panic-on-warn is "panic on the first sight of a problem", that 
sounds right.

That change should not be part of this patch, though.

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by Demi Marie Obenour 2 months, 2 weeks ago
On 7/18/25 03:44, David Hildenbrand wrote:
> On 18.07.25 00:06, Demi Marie Obenour wrote:
>> On 7/17/25 07:52, David Hildenbrand wrote:
>>> print_bad_pte() looks like something that should actually be a WARN
>>> or similar, but historically it apparently has proven to be useful to
>>> detect corruption of page tables even on production systems -- report
>>> the issue and keep the system running to make it easier to actually detect
>>> what is going wrong (e.g., multiple such messages might shed a light).
>>>
>>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>>> to take care of print_bad_pte() as well.
>>>
>>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>>> implementation and renaming the function -- we'll rename it to what
>>> we actually print: bad (page) mappings. Maybe it should be called
>>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>>> because the assumption is that we are dealing with some (previously)
>>> present page table entry that got corrupted in weird ways.
>>>
>>> Whether it is a PTE or something else will usually become obvious from the
>>> page table dump or from the dumped stack. If ever required in the future,
>>> we could pass the entry level type similar to "enum rmap_level". For now,
>>> let's keep it simple.
>>>
>>> To make the function a bit more readable, factor out the ratelimit check
>>> into is_bad_page_map_ratelimited() and place the dumping of page
>>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>>> information from each level in a single line, and just stop the table
>>> walk once we hit something that is not a present page table.
>>>
>>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>>> for vm_normal_page(), now that we have a function that can handle it.
>>>
>>> The report will now look something like (dumping pgd to pmd values):
>>>
>>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>>
>>> Not using pgdp_get(), because that does not work properly on some arm
>>> configs where pgd_t is an array. Note that we are dumping all levels
>>> even when levels are folded for simplicity.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> Should this still use a WARN?  If the admin sets panic-on-warn they
>> have asked for "crash if anything goes wrong" and so that is what
>> they should get.  Otherwise the system will still stay up.
> 
> I assume you're comment is in context of the other proposal regarding 
> panicking.

Which one?  I'm only subscribed to xen-devel and might have missed it.

> It's a good question whether we should WARN: likely we should convert 
> the "BUG:" ... message into a WARN. On panic-on-warn you'd panic 
> immediately without being able to observe any other such messages (and 
> as discussed in the RFC, apparently that can be valuable for debugging, 
> because a single such report is often insufficient)
> 
> But as panic-on-warn is "panic on the first sight of a problem", that 
> sounds right.

There are other advantages to WARN() as well:

- Automated tools (like syzkaller) know how to deal with it (though they
  probably know how to deal with this too).

- It counts against kernel.warn_limit, so an administrator can choose
  to have the system panic if there are a huge number of warnings.
  (My completely uninformed guess is that "other such messages" would
  usually be less than 100, and that once one has gotten into the
  quadruple digits, the kernel is probably hopelessly confused.)

- It shows up in /sys/kernel/warn_count.

> That change should not be part of this patch, though.

Fair.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by David Hildenbrand 2 months, 2 weeks ago
On 18.07.25 09:59, Demi Marie Obenour wrote:
> On 7/18/25 03:44, David Hildenbrand wrote:
>> On 18.07.25 00:06, Demi Marie Obenour wrote:
>>> On 7/17/25 07:52, David Hildenbrand wrote:
>>>> print_bad_pte() looks like something that should actually be a WARN
>>>> or similar, but historically it apparently has proven to be useful to
>>>> detect corruption of page tables even on production systems -- report
>>>> the issue and keep the system running to make it easier to actually detect
>>>> what is going wrong (e.g., multiple such messages might shed a light).
>>>>
>>>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>>>> to take care of print_bad_pte() as well.
>>>>
>>>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>>>> implementation and renaming the function -- we'll rename it to what
>>>> we actually print: bad (page) mappings. Maybe it should be called
>>>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>>>> because the assumption is that we are dealing with some (previously)
>>>> present page table entry that got corrupted in weird ways.
>>>>
>>>> Whether it is a PTE or something else will usually become obvious from the
>>>> page table dump or from the dumped stack. If ever required in the future,
>>>> we could pass the entry level type similar to "enum rmap_level". For now,
>>>> let's keep it simple.
>>>>
>>>> To make the function a bit more readable, factor out the ratelimit check
>>>> into is_bad_page_map_ratelimited() and place the dumping of page
>>>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>>>> information from each level in a single line, and just stop the table
>>>> walk once we hit something that is not a present page table.
>>>>
>>>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>>>> for vm_normal_page(), now that we have a function that can handle it.
>>>>
>>>> The report will now look something like (dumping pgd to pmd values):
>>>>
>>>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>>>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>>>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>>>
>>>> Not using pgdp_get(), because that does not work properly on some arm
>>>> configs where pgd_t is an array. Note that we are dumping all levels
>>>> even when levels are folded for simplicity.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Should this still use a WARN?  If the admin sets panic-on-warn they
>>> have asked for "crash if anything goes wrong" and so that is what
>>> they should get.  Otherwise the system will still stay up.
>>
>> I assume you're comment is in context of the other proposal regarding
>> panicking.
> 
> Which one?  I'm only subscribed to xen-devel and might have missed it.

This one here:

https://lkml.kernel.org/r/4e1b7d2d-ed54-4e0a-a0a4-906b14d9cd41@p183

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by Lorenzo Stoakes 2 months, 3 weeks ago
On Thu, Jul 17, 2025 at 01:52:09PM +0200, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
>
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.

Oh god. I reviewed this below. BUT OH GOD. What. Why???

>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 94 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 173eb6267e0ac..08d16ed7b4cc7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  			add_mm_counter(mm, i, rss[i]);
>  }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> -			  pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
>  {
> -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> -	p4d_t *p4d = p4d_offset(pgd, addr);
> -	pud_t *pud = pud_offset(p4d, addr);
> -	pmd_t *pmd = pmd_offset(pud, addr);
> -	struct address_space *mapping;
> -	pgoff_t index;
>  	static unsigned long resume;
>  	static unsigned long nr_shown;
>  	static unsigned long nr_unshown;
> @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  	if (nr_shown == 60) {
>  		if (time_before(jiffies, resume)) {
>  			nr_unshown++;
> -			return;
> +			return true;
>  		}
>  		if (nr_unshown) {
>  			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	if (nr_shown++ == 0)
>  		resume = jiffies + 60 * HZ;
> +	return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> +	unsigned long long pgdv, p4dv, pudv, pmdv;

> +	p4d_t p4d, *p4dp;
> +	pud_t pud, *pudp;
> +	pmd_t pmd, *pmdp;
> +	pgd_t *pgdp;
> +
> +	/*
> +	 * This looks like a fully lockless walk, however, the caller is
> +	 * expected to hold the leaf page table lock in addition to other
> +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> +	 * content while any concurrent modifications should be completely
> +	 * prevented.
> +	 */

Hmmm :)

Why aren't we trying to lock at leaf level?

We need to:

- Keep VMA stable which prevents unmap page table teardown and khugepaged
  collapse.
- (not relevant as we don't traverse PTE table but) RCU lock for PTE
  entries to avoid MADV_DONTNEED page table withdrawal.

Buuut if we're not locking at leaf level, we leave ourselves open to racing
faults, zaps, etc. etc.

So perhaps this why you require such strict conditions...

But can you truly be sure of these existing? And we should then assert them
here no? For rmap though we'd need the folio/vma.

> +	pgdp = pgd_offset(mm, addr);
> +	pgdv = pgd_val(*pgdp);

Before I went and looked again at the commit msg I said:

	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
	 helper for other levels."

But obviously yeah. You explained the insane reason why not.

> +
> +	if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
> +		pr_alert("pgd:%08llx\n", pgdv);
> +		return;
> +	}
> +
> +	p4dp = p4d_offset(pgdp, addr);
> +	p4d = p4dp_get(p4dp);
> +	p4dv = p4d_val(p4d);
> +
> +	if (!p4d_present(p4d) || p4d_leaf(p4d)) {
> +		pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
> +		return;
> +	}
> +
> +	pudp = pud_offset(p4dp, addr);
> +	pud = pudp_get(pudp);
> +	pudv = pud_val(pud);
> +
> +	if (!pud_present(pud) || pud_leaf(pud)) {
> +		pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
> +		return;
> +	}
> +
> +	pmdp = pmd_offset(pudp, addr);
> +	pmd = pmdp_get(pmdp);
> +	pmdv = pmd_val(pmd);
> +
> +	/*
> +	 * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
> +	 * because the table should already be mapped by the caller and
> +	 * doing another map would be bad. print_bad_page_map() should
> +	 * already take care of printing the PTE.
> +	 */

I hate 32-bit kernels.

> +	pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
> +		 p4dv, pudv, pmdv);
> +}
> +
> +/*
> + * This function is called to print an error when a bad page table entry (e.g.,
> + * corrupted page table entry) is found. For example, we might have a
> + * PFN-mapped pte in a region that doesn't allow it.
> + *
> + * The calling function must still handle the error.
> + */

We have extremely strict locking conditions for the page table traversal... but
no mention of them here?

> +static void print_bad_page_map(struct vm_area_struct *vma,
> +		unsigned long addr, unsigned long long entry, struct page *page)
> +{
> +	struct address_space *mapping;
> +	pgoff_t index;
> +
> +	if (is_bad_page_map_ratelimited())
> +		return;
>
>  	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>  	index = linear_page_index(vma, addr);
>
> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> -		 current->comm,
> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);

Sort of wonder if this is even useful if you don't know what the 'entry'
is? But I guess the dump below will tell you.

Though maybe actually useful to see flags etc. in case some horrid
corruption happened and maybe dump isn't valid? But then the dump assumes
strict conditions to work so... can that happen?

> +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
>  	if (page)
> -		dump_page(page, "bad pte");
> +		dump_page(page, "bad page map");
>  	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
>  		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>  	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  		if (is_zero_pfn(pfn))
>  			return NULL;
>
> -		print_bad_pte(vma, addr, pte, NULL);
> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
>  		return NULL;
>  	}
>
> @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>
>  check_pfn:
>  	if (unlikely(pfn > highest_memmap_pfn)) {
> -		print_bad_pte(vma, addr, pte, NULL);
> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);

This is unrelated to your series, but I guess this is for cases where
you're e.g. iomapping or such? So it's not something in the memmap but it's
a PFN that might reference io memory or such?

>  		return NULL;
>  	}
>
> @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>
> -	if (unlikely(pmd_special(pmd)))
> +	if (unlikely(pmd_special(pmd))) {
> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +			return NULL;

I guess we'll bring this altogether in a later patch with vm_normal_page()
as getting a little duplicative :P

Makes me think that VM_SPECIAL is kind of badly named (other than fact
'special' is nebulous and overloaded in general) in that it contains stuff
that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
specialness wrt to underlying folio.

Then we have VM_IO, which strictly must not have an associated page right?
Which is the odd one out and I wonder if redundant somehow.

Anyway stuff to think about...

> +		if (is_huge_zero_pfn(pfn))
> +			return NULL;
> +
> +		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>  		return NULL;
> +	}
>
>  	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>  		if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -674,8 +739,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>
>  	if (is_huge_zero_pfn(pfn))
>  		return NULL;
> -	if (unlikely(pfn > highest_memmap_pfn))
> +	if (unlikely(pfn > highest_memmap_pfn)) {
> +		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>  		return NULL;
> +	}
>
>  	/*
>  	 * NOTE! We still have PageReserved() pages in the page tables.
> @@ -1509,7 +1576,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>  		folio_remove_rmap_ptes(folio, page, nr, vma);
>
>  		if (unlikely(folio_mapcount(folio) < 0))
> -			print_bad_pte(vma, addr, ptent, page);
> +			print_bad_page_map(vma, addr, pte_val(ptent), page);
>  	}
>  	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>  		*force_flush = true;
> @@ -4507,7 +4574,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		} else if (is_pte_marker_entry(entry)) {
>  			ret = handle_pte_marker(vmf);
>  		} else {
> -			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
> +			print_bad_page_map(vma, vmf->address,
> +					   pte_val(vmf->orig_pte), NULL);
>  			ret = VM_FAULT_SIGBUS;
>  		}
>  		goto out;
> --
> 2.50.1
>
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by David Hildenbrand 2 months, 3 weeks ago
>> The report will now look something like (dumping pgd to pmd values):
>>
>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>
>> Not using pgdp_get(), because that does not work properly on some arm
>> configs where pgd_t is an array. Note that we are dumping all levels
>> even when levels are folded for simplicity.
> 
> Oh god. I reviewed this below. BUT OH GOD. What. Why???

Exactly.

I wish this patch wouldn't exist, but Hugh convinced me that apparently 
it can be useful in the real world.

> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 94 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 173eb6267e0ac..08d16ed7b4cc7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>>   			add_mm_counter(mm, i, rss[i]);
>>   }
>>
>> -/*
>> - * This function is called to print an error when a bad pte
>> - * is found. For example, we might have a PFN-mapped pte in
>> - * a region that doesn't allow it.
>> - *
>> - * The calling function must still handle the error.
>> - */
>> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>> -			  pte_t pte, struct page *page)
>> +static bool is_bad_page_map_ratelimited(void)
>>   {
>> -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
>> -	p4d_t *p4d = p4d_offset(pgd, addr);
>> -	pud_t *pud = pud_offset(p4d, addr);
>> -	pmd_t *pmd = pmd_offset(pud, addr);
>> -	struct address_space *mapping;
>> -	pgoff_t index;
>>   	static unsigned long resume;
>>   	static unsigned long nr_shown;
>>   	static unsigned long nr_unshown;
>> @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>   	if (nr_shown == 60) {
>>   		if (time_before(jiffies, resume)) {
>>   			nr_unshown++;
>> -			return;
>> +			return true;
>>   		}
>>   		if (nr_unshown) {
>>   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
>> @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>   	}
>>   	if (nr_shown++ == 0)
>>   		resume = jiffies + 60 * HZ;
>> +	return false;
>> +}
>> +
>> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
>> +{
>> +	unsigned long long pgdv, p4dv, pudv, pmdv;
> 
>> +	p4d_t p4d, *p4dp;
>> +	pud_t pud, *pudp;
>> +	pmd_t pmd, *pmdp;
>> +	pgd_t *pgdp;
>> +
>> +	/*
>> +	 * This looks like a fully lockless walk, however, the caller is
>> +	 * expected to hold the leaf page table lock in addition to other
>> +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
>> +	 * content while any concurrent modifications should be completely
>> +	 * prevented.
>> +	 */
> 
> Hmmm :)
> 
> Why aren't we trying to lock at leaf level?

Ehm, did you read the:

"the caller is expected to hold the leaf page table lock"

:)


> 
> We need to:
> 
> - Keep VMA stable which prevents unmap page table teardown and khugepaged
>    collapse.
> - (not relevant as we don't traverse PTE table but) RCU lock for PTE
>    entries to avoid MADV_DONTNEED page table withdrawal.
> 
> Buuut if we're not locking at leaf level, we leave ourselves open to racing
> faults, zaps, etc. etc.

Yes. I can clarify in the comment of print_bad_page_map(), that it is 
expected to be called with the PTL (unwritten rule, not changing that).

> 
> So perhaps this why you require such strict conditions...
> 
> But can you truly be sure of these existing? And we should then assert them
> here no? For rmap though we'd need the folio/vma.

I hope you realize that this nastiness of a code is called in case our 
system is already running into something extremely unexpected and will 
probably be dead soon.

So I am not to interested in adding anything more here. If you run into 
this code you're in big trouble already.

> 
>> +	pgdp = pgd_offset(mm, addr);
>> +	pgdv = pgd_val(*pgdp);
> 
> Before I went and looked again at the commit msg I said:
> 
> 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> 	 helper for other levels."
> 
> But obviously yeah. You explained the insane reason why not.

Had to find out the hard way ... :)

[...]

>> +/*
>> + * This function is called to print an error when a bad page table entry (e.g.,
>> + * corrupted page table entry) is found. For example, we might have a
>> + * PFN-mapped pte in a region that doesn't allow it.
>> + *
>> + * The calling function must still handle the error.
>> + */
> 
> We have extremely strict locking conditions for the page table traversal... but
> no mention of them here?

Yeah, I can add that.

> 
>> +static void print_bad_page_map(struct vm_area_struct *vma,
>> +		unsigned long addr, unsigned long long entry, struct page *page)
>> +{
>> +	struct address_space *mapping;
>> +	pgoff_t index;
>> +
>> +	if (is_bad_page_map_ratelimited())
>> +		return;
>>
>>   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>>   	index = linear_page_index(vma, addr);
>>
>> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
>> -		 current->comm,
>> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
>> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> 
> Sort of wonder if this is even useful if you don't know what the 'entry'
> is? But I guess the dump below will tell you.

You probably missed in the patch description:

"Whether it is a PTE or something else will usually become obvious from 
the page table dump or from the dumped stack. If ever required in the 
future, we could pass the entry level type similar to "enum rmap_level". 
For now, let's keep it simple."

> 
> Though maybe actually useful to see flags etc. in case some horrid
> corruption happened and maybe dump isn't valid? But then the dump assumes
> strict conditions to work so... can that happen?

Not sure what you mean, can you elaborate?

If you system is falling apart completely, I'm afraid this report here 
will not safe you.

You'll probably get a flood of 60 ... if your system doesn't collapse 
before that.

> 
>> +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
>>   	if (page)
>> -		dump_page(page, "bad pte");
>> +		dump_page(page, "bad page map");
>>   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
>>   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>>   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
>> @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>   		if (is_zero_pfn(pfn))
>>   			return NULL;
>>
>> -		print_bad_pte(vma, addr, pte, NULL);
>> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
>>   		return NULL;
>>   	}
>>
>> @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>
>>   check_pfn:
>>   	if (unlikely(pfn > highest_memmap_pfn)) {
>> -		print_bad_pte(vma, addr, pte, NULL);
>> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> 
> This is unrelated to your series, but I guess this is for cases where
> you're e.g. iomapping or such? So it's not something in the memmap but it's
> a PFN that might reference io memory or such?

No, it's just an easy check for catching corruptions. In a later patch I 
add a comment.

> 
>>   		return NULL;
>>   	}
>>
>> @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   {
>>   	unsigned long pfn = pmd_pfn(pmd);
>>
>> -	if (unlikely(pmd_special(pmd)))
>> +	if (unlikely(pmd_special(pmd))) {
>> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>> +			return NULL;
> 
> I guess we'll bring this altogether in a later patch with vm_normal_page()
> as getting a little duplicative :P

Not that part, but the other part :)

> 
> Makes me think that VM_SPECIAL is kind of badly named (other than fact
> 'special' is nebulous and overloaded in general) in that it contains stuff
> that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> specialness wrt to underlying folio.

It is.

> 
> Then we have VM_IO, which strictly must not have an associated page right?

VM_IO just means read/write side-effects, I think you could have ones 
with an memmap easily ... e.g., memory section (128MiB) spanning both 
memory and MMIO regions.

Thanks!

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by Lorenzo Stoakes 2 months, 2 weeks ago
On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote:
> > > The report will now look something like (dumping pgd to pmd values):
> > >
> > > [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> > > [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > > [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> > >
> > > Not using pgdp_get(), because that does not work properly on some arm
> > > configs where pgd_t is an array. Note that we are dumping all levels
> > > even when levels are folded for simplicity.
> >
> > Oh god. I reviewed this below. BUT OH GOD. What. Why???
>
> Exactly.
>
> I wish this patch wouldn't exist, but Hugh convinced me that apparently it
> can be useful in the real world.

Yeah... I mean out of scope for here but that sounds dubious. On the other hand,
we use typedef for page table values etc. etc. so we do make this possible.

>
> >
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> > >   1 file changed, 94 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 173eb6267e0ac..08d16ed7b4cc7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > >   			add_mm_counter(mm, i, rss[i]);
> > >   }
> > >
> > > -/*
> > > - * This function is called to print an error when a bad pte
> > > - * is found. For example, we might have a PFN-mapped pte in
> > > - * a region that doesn't allow it.
> > > - *
> > > - * The calling function must still handle the error.
> > > - */
> > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > -			  pte_t pte, struct page *page)
> > > +static bool is_bad_page_map_ratelimited(void)
> > >   {
> > > -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > -	p4d_t *p4d = p4d_offset(pgd, addr);
> > > -	pud_t *pud = pud_offset(p4d, addr);
> > > -	pmd_t *pmd = pmd_offset(pud, addr);
> > > -	struct address_space *mapping;
> > > -	pgoff_t index;
> > >   	static unsigned long resume;
> > >   	static unsigned long nr_shown;
> > >   	static unsigned long nr_unshown;
> > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (nr_shown == 60) {
> > >   		if (time_before(jiffies, resume)) {
> > >   			nr_unshown++;
> > > -			return;
> > > +			return true;
> > >   		}
> > >   		if (nr_unshown) {
> > >   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	}
> > >   	if (nr_shown++ == 0)
> > >   		resume = jiffies + 60 * HZ;
> > > +	return false;
> > > +}
> > > +
> > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	unsigned long long pgdv, p4dv, pudv, pmdv;
> >
> > > +	p4d_t p4d, *p4dp;
> > > +	pud_t pud, *pudp;
> > > +	pmd_t pmd, *pmdp;
> > > +	pgd_t *pgdp;
> > > +
> > > +	/*
> > > +	 * This looks like a fully lockless walk, however, the caller is
> > > +	 * expected to hold the leaf page table lock in addition to other
> > > +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> > > +	 * content while any concurrent modifications should be completely
> > > +	 * prevented.
> > > +	 */
> >
> > Hmmm :)
> >
> > Why aren't we trying to lock at leaf level?
>
> Ehm, did you read the:
>
> "the caller is expected to hold the leaf page table lock"
>
> :)

Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
but I guess the intent is that the caller _must_ hold this lock.

I know it's nitty and annoying (sorry!) but as asserting seems to not be a
possibility here, could we spell these out as a series of points like:

/*
 * The caller MUST hold the following locks:
 *
 * - Leaf page table lock
 * - Appropriate VMA lock to keep VMA stable
 */

I don't _actually_ think you need the rmap lock then, as none of the page tables
you access would be impacted by any rmap action afaict, with these locks held.
>
>
> >
> > We need to:
> >
> > - Keep VMA stable which prevents unmap page table teardown and khugepaged
> >    collapse.
> > - (not relevant as we don't traverse PTE table but) RCU lock for PTE
> >    entries to avoid MADV_DONTNEED page table withdrawal.
> >
> > Buuut if we're not locking at leaf level, we leave ourselves open to racing
> > faults, zaps, etc. etc.
>
> Yes. I can clarify in the comment of print_bad_page_map(), that it is
> expected to be called with the PTL (unwritten rule, not changing that).

Right, see above, just needs more clarity then.

>
> >
> > So perhaps this why you require such strict conditions...
> >
> > But can you truly be sure of these existing? And we should then assert them
> > here no? For rmap though we'd need the folio/vma.
>
> I hope you realize that this nastiness of a code is called in case our
> system is already running into something extremely unexpected and will
> probably be dead soon.
>
> So I am not to interested in adding anything more here. If you run into this
> code you're in big trouble already.

Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
held as stated those won't occur.

But f it's not sensible to do it then we don't have to :) I am a reasonable
man, or like to think I am ;)

But I think we need clarity as per the above.

>
> >
> > > +	pgdp = pgd_offset(mm, addr);
> > > +	pgdv = pgd_val(*pgdp);
> >
> > Before I went and looked again at the commit msg I said:
> >
> > 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> > 	 helper for other levels."
> >
> > But obviously yeah. You explained the insane reason why not.
>
> Had to find out the hard way ... :)

Pain.

>
> [...]
>
> > > +/*
> > > + * This function is called to print an error when a bad page table entry (e.g.,
> > > + * corrupted page table entry) is found. For example, we might have a
> > > + * PFN-mapped pte in a region that doesn't allow it.
> > > + *
> > > + * The calling function must still handle the error.
> > > + */
> >
> > We have extremely strict locking conditions for the page table traversal... but
> > no mention of them here?
>
> Yeah, I can add that.

Thanks!

>
> >
> > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +	pgoff_t index;
> > > +
> > > +	if (is_bad_page_map_ratelimited())
> > > +		return;
> > >
> > >   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > >   	index = linear_page_index(vma, addr);
> > >
> > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > -		 current->comm,
> > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> >
> > Sort of wonder if this is even useful if you don't know what the 'entry'
> > is? But I guess the dump below will tell you.
>
> You probably missed in the patch description:
>
> "Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future, we
> could pass the entry level type similar to "enum rmap_level". For now, let's
> keep it simple."

Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
is fine then.

>
> >
> > Though maybe actually useful to see flags etc. in case some horrid
> > corruption happened and maybe dump isn't valid? But then the dump assumes
> > strict conditions to work so... can that happen?
>
> Not sure what you mean, can you elaborate?
>
> If you system is falling apart completely, I'm afraid this report here will
> not safe you.
>
> You'll probably get a flood of 60 ... if your system doesn't collapse before
> that.

I was musing whistfully about the value of outputting the entries. But I
guess _any_ information before a crash is useful. But your dump output will
be a lot more useful :)

>
> >
> > > +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
> > >   	if (page)
> > > -		dump_page(page, "bad pte");
> > > +		dump_page(page, "bad page map");
> > >   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> > >   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > >   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >   		if (is_zero_pfn(pfn))
> > >   			return NULL;
> > >
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >
> > >   check_pfn:
> > >   	if (unlikely(pfn > highest_memmap_pfn)) {
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> >
> > This is unrelated to your series, but I guess this is for cases where
> > you're e.g. iomapping or such? So it's not something in the memmap but it's
> > a PFN that might reference io memory or such?
>
> No, it's just an easy check for catching corruptions. In a later patch I add
> a comment.

Ohhh right. I was overthinking this haha

>
> >
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   {
> > >   	unsigned long pfn = pmd_pfn(pmd);
> > >
> > > -	if (unlikely(pmd_special(pmd)))
> > > +	if (unlikely(pmd_special(pmd))) {
> > > +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +			return NULL;
> >
> > I guess we'll bring this altogether in a later patch with vm_normal_page()
> > as getting a little duplicative :P
>
> Not that part, but the other part :)

Yes :)

>
> >
> > Makes me think that VM_SPECIAL is kind of badly named (other than fact
> > 'special' is nebulous and overloaded in general) in that it contains stuff
> > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> > specialness wrt to underlying folio.
>
> It is.

Yeah I think we in mm periodically moan about this whole thing...

>
> >
> > Then we have VM_IO, which strictly must not have an associated page right?
>
> VM_IO just means read/write side-effects, I think you could have ones with
> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> MMIO regions.

Hmm, but why not have two separate VMAs? I guess I need to look into more
what this flag actually effects.

But in terms of VMA manipulation in any case it really is no different from
e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except
of course this whole vm_normal_page() thing...!

But I've not seen VM_IO set alone anywhere. On the other hand, I haven't
really dug deep on this...

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by David Hildenbrand 2 months, 2 weeks ago
> 
> Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
> but I guess the intent is that the caller _must_ hold this lock.
> 
> I know it's nitty and annoying (sorry!) but as asserting seems to not be a
> possibility here, could we spell these out as a series of points like:
> 
> /*
>   * The caller MUST hold the following locks:
>   *
>   * - Leaf page table lock
>   * - Appropriate VMA lock to keep VMA stable
>   */
> 
> I don't _actually_ think you need the rmap lock then, as none of the page tables
> you access would be impacted by any rmap action afaict, with these locks held.

I don't enjoy wrong comments ;)

This can be called from rmap code when doing a vm_normal_page() while 
holding the PTL.

Really, I think we are over-thinking a helper that is triggered in 
specific context when the world is about to collide.

This is not your general-purpose API.

Maybe I should have never added a comment. Maybe I should just not have 
done this patch, because I really don't want to do more than the bare 
minimum to print_bad_page_map().

Because I deeply detest it, and no comments we will add will change that.

[...]

>>> But can you truly be sure of these existing? And we should then assert them
>>> here no? For rmap though we'd need the folio/vma.
>>
>> I hope you realize that this nastiness of a code is called in case our
>> system is already running into something extremely unexpected and will
>> probably be dead soon.
>>
>> So I am not to interested in adding anything more here. If you run into this
>> code you're in big trouble already.
> 
> Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
> held as stated those won't occur.
> 
> But f it's not sensible to do it then we don't have to :) I am a reasonable
> man, or like to think I am ;)
> 
> But I think we need clarity as per the above.
> 
>>
>>>
>>>> +	pgdp = pgd_offset(mm, addr);
>>>> +	pgdv = pgd_val(*pgdp);
>>>
>>> Before I went and looked again at the commit msg I said:
>>>
>>> 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
>>> 	 helper for other levels."
>>>
>>> But obviously yeah. You explained the insane reason why not.
>>
>> Had to find out the hard way ... :)
> 
> Pain.
> 
>>
>> [...]
>>
>>>> +/*
>>>> + * This function is called to print an error when a bad page table entry (e.g.,
>>>> + * corrupted page table entry) is found. For example, we might have a
>>>> + * PFN-mapped pte in a region that doesn't allow it.
>>>> + *
>>>> + * The calling function must still handle the error.
>>>> + */
>>>
>>> We have extremely strict locking conditions for the page table traversal... but
>>> no mention of them here?
>>
>> Yeah, I can add that.
> 
> Thanks!
> 
>>
>>>
>>>> +static void print_bad_page_map(struct vm_area_struct *vma,
>>>> +		unsigned long addr, unsigned long long entry, struct page *page)
>>>> +{
>>>> +	struct address_space *mapping;
>>>> +	pgoff_t index;
>>>> +
>>>> +	if (is_bad_page_map_ratelimited())
>>>> +		return;
>>>>
>>>>    	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>>>>    	index = linear_page_index(vma, addr);
>>>>
>>>> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
>>>> -		 current->comm,
>>>> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
>>>> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
>>>
>>> Sort of wonder if this is even useful if you don't know what the 'entry'
>>> is? But I guess the dump below will tell you.
>>
>> You probably missed in the patch description:
>>
>> "Whether it is a PTE or something else will usually become obvious from the
>> page table dump or from the dumped stack. If ever required in the future, we
>> could pass the entry level type similar to "enum rmap_level". For now, let's
>> keep it simple."
> 
> Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
> is fine then.

Let me play with indicating the page table level, but it's the kind of 
stuff I wouldn't want to do in this series here.

>>
>>>
>>> Then we have VM_IO, which strictly must not have an associated page right?
>>
>> VM_IO just means read/write side-effects, I think you could have ones with
>> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
>> MMIO regions.
> 
> Hmm, but why not have two separate VMAs? I guess I need to look into more
> what this flag actually effects.

Oh, I meant, that we might have a "struct page" for MMIO memory 
(pfn_valid() == true).

In a MIXEDMAP that will get refcounted. Not sure if there are users that 
use VM_IO in a MIXEDMAP, I would assume so but didn't check.

So VM_IO doesn't really interact with vm_normal_page(), really. It's all 
about PFNMAP and MIXEDMAP.

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Posted by Lorenzo Stoakes 2 months, 2 weeks ago
On Fri, Jul 18, 2025 at 01:04:30PM +0200, David Hildenbrand wrote:
> >
> > Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
> > but I guess the intent is that the caller _must_ hold this lock.
> >
> > I know it's nitty and annoying (sorry!) but as asserting seems to not be a
> > possibility here, could we spell these out as a series of points like:
> >
> > /*
> >   * The caller MUST hold the following locks:
> >   *
> >   * - Leaf page table lock
> >   * - Appropriate VMA lock to keep VMA stable
> >   */
> >
> > I don't _actually_ think you need the rmap lock then, as none of the page tables
> > you access would be impacted by any rmap action afaict, with these locks held.
>
> I don't enjoy wrong comments ;)
>
> This can be called from rmap code when doing a vm_normal_page() while
> holding the PTL.

OK so this is just underlining the confusion here (not your fault! I mean in
general with page table code, really).

Would this possibly work better then?:

/*
 * The caller MUST prevent page table teardown (by holding mmap, vma or rmap lock)
 * and MUST hold the leaf page table lock.
 */

>
> Really, I think we are over-thinking a helper that is triggered in specific
> context when the world is about to collide.

Indeed, but I think it's important to be clear as to what is required for this
to work (even though, as I understand it, we're already in trouble and if page
tables are corrupted or something like this we may even NULL ptr deref anyway so
it's best effort).

>
> This is not your general-purpose API.
>
> Maybe I should have never added a comment. Maybe I should just not have done
> this patch, because I really don't want to do more than the bare minimum to
> print_bad_page_map().

No no David no :P come back to the light sir...

This is a good patch I don't mean to dissuade you, I just want to make things
clear in a confusing bit of the kernel as we in mm as usual make life hard for
ourselves...

I think the locking comment _is_ important, as for far too long in mm we've had
_implicit_ understanding of where the locks should be at any given time, which
constantly blows up underneath us.

I'd rather keep things as clear as possible so even the intellectually
challenged such as myself are subject to less confusion.

Anyway TL;DR if you do something similar to suggestion above it's all good. Just
needs clarification.

>
> Because I deeply detest it, and no comments we will add will change that.

So do I! *clinks glass* but yes, it's horrid. But there's value in improving
quality of horrid code and refactoring to the least-worst version.

And we can attack it in a more serious way later.

[snip]

> > > > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > > > +{
> > > > > +	struct address_space *mapping;
> > > > > +	pgoff_t index;
> > > > > +
> > > > > +	if (is_bad_page_map_ratelimited())
> > > > > +		return;
> > > > >
> > > > >    	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > > > >    	index = linear_page_index(vma, addr);
> > > > >
> > > > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > > > -		 current->comm,
> > > > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> > > >
> > > > Sort of wonder if this is even useful if you don't know what the 'entry'
> > > > is? But I guess the dump below will tell you.
> > >
> > > You probably missed in the patch description:
> > >
> > > "Whether it is a PTE or something else will usually become obvious from the
> > > page table dump or from the dumped stack. If ever required in the future, we
> > > could pass the entry level type similar to "enum rmap_level". For now, let's
> > > keep it simple."
> >
> > Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
> > is fine then.
>
> Let me play with indicating the page table level, but it's the kind of stuff
> I wouldn't want to do in this series here.

Sure understood. I don't mean to hold this up with nits. Am happy to be
flexible, just thinking out loud generally.

>
> > >
> > > >
> > > > Then we have VM_IO, which strictly must not have an associated page right?
> > >
> > > VM_IO just means read/write side-effects, I think you could have ones with
> > > an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> > > MMIO regions.
> >
> > Hmm, but why not have two separate VMAs? I guess I need to look into more
> > what this flag actually effects.
>
> Oh, I meant, that we might have a "struct page" for MMIO memory (pfn_valid()
> == true).
>
> In a MIXEDMAP that will get refcounted. Not sure if there are users that use
> VM_IO in a MIXEDMAP, I would assume so but didn't check.
>
> So VM_IO doesn't really interact with vm_normal_page(), really. It's all
> about PFNMAP and MIXEDMAP.

Thanks, yeah am thinking more broadly about VM_SPECIAL here. May go off and do
some work on that... VM_UNMERGEABLE might be better.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo