mm/debug_vm_pgtable: clear page table entries at destroy_args()

[PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Herton R. Krzesinski 6 months, 1 week ago

The mm/debug_vm_pagetable test allocates manually page table entries for the
tests it runs, using also its manually allocated mm_struct. That in itself is
ok, but when it exits, at destroy_args() it fails to clear those entries with
the *_clear functions.

The problem is that leaves stale entries. If another process allocates
an mm_struct with a pgd at the same address, it may end up running into
the stale entry. This is happening in practice on a debug kernel with
CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
extra debugging I added (it prints a warning trace if pgtables_bytes goes
negative, in addition to the warning at check_mm() function):

[    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
[    2.539366] kmem_cache info
[    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
[    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
(...)
[    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
[    2.552816] Modules linked in:
[    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
[    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
[    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
[    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
[    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
[    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
[    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
[    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
[    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
[    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
[    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
[    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
[    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
[    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
[    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
[    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
[    2.553199] Call Trace:
[    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
[    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
[    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
[    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
[    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
[    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
[    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
[    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
[    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
[    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
[    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
(...)
[    2.558892] ---[ end trace 0000000000000000 ]---
[    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
[    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144

Here the modprobe process ended up with an allocated mm_struct from the
mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
problem, since the mm_struct is initialized again etc., however, if it ends up
using the same pgd table, it bumps into the old stale entry when clearing/freeing
the page table entries, so it tries to free an entry already gone (that one
which was allocated by the debug_vm_pgtable test), which also explains the
negative pgtables_bytes since it's accounting for not allocated entries in the
current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
and clearing of the entries is explicitly done in the free_pgtables->
free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
free_pte_range path. However, the debug_vm_pgtable test does not call
free_pgtables, since it allocates mm_struct and entries manually for its test
and eg. not goes through page faults. So it also should clear manually the
entries before exit at destroy_args().

This problem was noticed on a reboot X number of times test being done
on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
enabled. Depends on the system, but on a 100 times reboot loop the
problem could manifest once or twice, if a process ends up getting the
right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
After using this patch, I couldn't reproduce/experience the problems
anymore. I was able to reproduce the problem as well on latest upstream
kernel (6.16).

I also modified destroy_args() to use mmput() instead of mmdrop(), there
is no reason to hold mm_users reference and not release the mm_struct
entirely, and in the output above with my debugging prints I already
had patched it to use mmput, it did not fix the problem, but helped
in the debugging as well.

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
---
 mm/debug_vm_pgtable.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 7731b238b534..0f5ddefd128a 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
 
 	/* Free page table entries */
 	if (args->start_ptep) {
+		pmd_clear(args->pmdp);
 		pte_free(args->mm, args->start_ptep);
 		mm_dec_nr_ptes(args->mm);
 	}
 
 	if (args->start_pmdp) {
+		pud_clear(args->pudp);
 		pmd_free(args->mm, args->start_pmdp);
 		mm_dec_nr_pmds(args->mm);
 	}
 
 	if (args->start_pudp) {
+		p4d_clear(args->p4dp);
 		pud_free(args->mm, args->start_pudp);
 		mm_dec_nr_puds(args->mm);
 	}
 
-	if (args->start_p4dp)
+	if (args->start_p4dp) {
+		pgd_clear(args->pgdp);
 		p4d_free(args->mm, args->start_p4dp);
+	}
 
 	/* Free vma and mm struct */
 	if (args->vma)
 		vm_area_free(args->vma);
 
 	if (args->mm)
-		mmdrop(args->mm);
+		mmput(args->mm);
 }
 
 static struct page * __init
-- 
2.47.1

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Anshuman Khandual 6 months, 1 week ago

Hello Herton,

On 01/08/25 3:10 AM, Herton R. Krzesinski wrote:
> The mm/debug_vm_pagetable test allocates manually page table entries for the
> tests it runs, using also its manually allocated mm_struct. That in itself is
> ok, but when it exits, at destroy_args() it fails to clear those entries with
> the *_clear functions.
> 
> The problem is that leaves stale entries. If another process allocates
> an mm_struct with a pgd at the same address, it may end up running into
> the stale entry. This is happening in practice on a debug kernel with

Should not the allocators ensure that the allocated memory elements are
all cleaned up before using them ?

> CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
> extra debugging I added (it prints a warning trace if pgtables_bytes goes
> negative, in addition to the warning at check_mm() function):
> 
> [    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
> [    2.539366] kmem_cache info
> [    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
> [    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
> (...)
> [    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
> [    2.552816] Modules linked in:
> [    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
> [    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
> [    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
> [    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
> [    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
> [    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
> [    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
> [    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
> [    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
> [    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
> [    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
> [    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
> [    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
> [    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
> [    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
> [    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
> [    2.553199] Call Trace:
> [    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
> [    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
> [    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
> [    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
> [    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
> [    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
> [    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
> [    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
> [    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
> [    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
> [    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> (...)
> [    2.558892] ---[ end trace 0000000000000000 ]---
> [    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
> [    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
> 
> Here the modprobe process ended up with an allocated mm_struct from the
> mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
> problem, since the mm_struct is initialized again etc., however, if it ends up
> using the same pgd table, it bumps into the old stale entry when clearing/freeing
> the page table entries, so it tries to free an entry already gone (that one
> which was allocated by the debug_vm_pgtable test), which also explains the

How did you ensure that it was allocated from debug_vm_pgtable ? Trace prints during
its execution and then matching up the addresses ? Just curious.

> negative pgtables_bytes since it's accounting for not allocated entries in the
> current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
So should they clear entries or doing so would add to overall latency ?

> and clearing of the entries is explicitly done in the free_pgtables->
> free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
> free_pte_range path. However, the debug_vm_pgtable test does not call
> free_pgtables, since it allocates mm_struct and entries manually for its test
> and eg. not goes through page faults. So it also should clear manually the
> entries before exit at destroy_args().

Makes sense.

> 
> This problem was noticed on a reboot X number of times test being done
> on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
> enabled. Depends on the system, but on a 100 times reboot loop the
> problem could manifest once or twice, if a process ends up getting the
> right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
> After using this patch, I couldn't reproduce/experience the problems
> anymore. I was able to reproduce the problem as well on latest upstream
> kernel (6.16).

Seems like a very rare case i.e both to reproduce and also to confirm if this patch
here has indeed solved the problem. Just wondering - did you try to reproduce this
problem on any other platform than powerpc ?

> 
> I also modified destroy_args() to use mmput() instead of mmdrop(), there
> is no reason to hold mm_users reference and not release the mm_struct
> entirely, and in the output above with my debugging prints I already
> had patched it to use mmput, it did not fix the problem, but helped
> in the debugging as well.

Makes sense.

> 
> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
> ---
>  mm/debug_vm_pgtable.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 7731b238b534..0f5ddefd128a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
>  
>  	/* Free page table entries */
>  	if (args->start_ptep) {
> +		pmd_clear(args->pmdp);
>  		pte_free(args->mm, args->start_ptep);
>  		mm_dec_nr_ptes(args->mm);
>  	}
>  
>  	if (args->start_pmdp) {
> +		pud_clear(args->pudp);
>  		pmd_free(args->mm, args->start_pmdp);
>  		mm_dec_nr_pmds(args->mm);
>  	}
>  
>  	if (args->start_pudp) {
> +		p4d_clear(args->p4dp);
>  		pud_free(args->mm, args->start_pudp);
>  		mm_dec_nr_puds(args->mm);
>  	}
>  
> -	if (args->start_p4dp)
> +	if (args->start_p4dp) {
> +		pgd_clear(args->pgdp);
>  		p4d_free(args->mm, args->start_p4dp);
> +	}
>  
>  	/* Free vma and mm struct */
>  	if (args->vma)
>  		vm_area_free(args->vma);
>  
>  	if (args->mm)
> -		mmdrop(args->mm);
> +		mmput(args->mm);
>  }
>  
>  static struct page * __init
A quick test on arm64 platform looked fine. It might be better to get this
enabled and tested on multiple platforms via linux-next.

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Herton Krzesinski 6 months, 1 week ago

On Thu, Jul 31, 2025 at 11:41 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> Hello Herton,
>
> On 01/08/25 3:10 AM, Herton R. Krzesinski wrote:
> > The mm/debug_vm_pagetable test allocates manually page table entries for the
> > tests it runs, using also its manually allocated mm_struct. That in itself is
> > ok, but when it exits, at destroy_args() it fails to clear those entries with
> > the *_clear functions.
> >
> > The problem is that leaves stale entries. If another process allocates
> > an mm_struct with a pgd at the same address, it may end up running into
> > the stale entry. This is happening in practice on a debug kernel with
>
> Should not the allocators ensure that the allocated memory elements are
> all cleaned up before using them ?

I did not saw anything which cleaned them. all the pgd/pud etc. alloc
functions do not clean them, so I think that's the default behaviour
from what I understand. I also used crash utility on a live kernel
reading the pgd address from the mm_struct that was allocated from the
debug_vm_pgtable test and already freed and saw that it was populated
even after it was freed.

>
> > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
> > extra debugging I added (it prints a warning trace if pgtables_bytes goes
> > negative, in addition to the warning at check_mm() function):
> >
> > [    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
> > [    2.539366] kmem_cache info
> > [    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
> > [    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
> > (...)
> > [    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
> > [    2.552816] Modules linked in:
> > [    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
> > [    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
> > [    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
> > [    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
> > [    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
> > [    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
> > [    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
> > [    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
> > [    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
> > [    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
> > [    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
> > [    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
> > [    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
> > [    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
> > [    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
> > [    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
> > [    2.553199] Call Trace:
> > [    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
> > [    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
> > [    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
> > [    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
> > [    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
> > [    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
> > [    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
> > [    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
> > [    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
> > [    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
> > [    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> > (...)
> > [    2.558892] ---[ end trace 0000000000000000 ]---
> > [    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
> > [    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
> >
> > Here the modprobe process ended up with an allocated mm_struct from the
> > mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
> > problem, since the mm_struct is initialized again etc., however, if it ends up
> > using the same pgd table, it bumps into the old stale entry when clearing/freeing
> > the page table entries, so it tries to free an entry already gone (that one
> > which was allocated by the debug_vm_pgtable test), which also explains the
>
> How did you ensure that it was allocated from debug_vm_pgtable ? Trace prints during
> its execution and then matching up the addresses ? Just curious.

Usually the mm_struct address would match, but the problem is the pgd
address, the pgd address allocated for the mm_struct matched. Yes
trace prints and the problem happening with same mm_struct->pgd. Also
disabling CONFIG_DEBUG_VM_PGTABLE also made the problem go away. It
was "easy" to reproduce on a powerpc machine (with a reboot loop, in
my case sometimes on a 50 or 100 times, since the test executes only
on early boot), if another process after it got the same mm->pgd by
accident it would get the problem (from experience looking into the
issue, it would happen on boot with udev firing lots of modprobe and
one eventually got the mm_struct from the slab and same pgd that was
used before). What lead me investigating into this was that I saw some
reports of "BUG: non-zero pgtables_bytes on freeing mm" messages
reports, sometimes then followed by corruption/panic usually related
to page table entries, on that reboot loop test. Then I was able to
determine that CONFIG_DEBUG_VM_PGTABLE was to blame, and from there
find out that even disabling the tests manually, only allocing the
pgtable entries was enough to trigger the issue.

>
> > negative pgtables_bytes since it's accounting for not allocated entries in the
> > current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
> So should they clear entries or doing so would add to overall latency ?
>
> > and clearing of the entries is explicitly done in the free_pgtables->
> > free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
> > free_pte_range path. However, the debug_vm_pgtable test does not call
> > free_pgtables, since it allocates mm_struct and entries manually for its test
> > and eg. not goes through page faults. So it also should clear manually the
> > entries before exit at destroy_args().
>
> Makes sense.
>
> >
> > This problem was noticed on a reboot X number of times test being done
> > on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
> > enabled. Depends on the system, but on a 100 times reboot loop the
> > problem could manifest once or twice, if a process ends up getting the
> > right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
> > After using this patch, I couldn't reproduce/experience the problems
> > anymore. I was able to reproduce the problem as well on latest upstream
> > kernel (6.16).
>
> Seems like a very rare case i.e both to reproduce and also to confirm if this patch
> here has indeed solved the problem. Just wondering - did you try to reproduce this
> problem on any other platform than powerpc ?

I only tried and then reproduced on ppc, since all reports I saw was
reproducing on it, didn't saw reports on other architectures. I tested
the patch on ppc/x86/s390/arm64 with a bigger X times reboot loop test
(for the test, a 200 times reboot loop). From my understanding,
another process getting the same mm->pgd was the key, so if a process
got lucky enough it triggers the issue.

>
> >
> > I also modified destroy_args() to use mmput() instead of mmdrop(), there
> > is no reason to hold mm_users reference and not release the mm_struct
> > entirely, and in the output above with my debugging prints I already
> > had patched it to use mmput, it did not fix the problem, but helped
> > in the debugging as well.
>
> Makes sense.
>
> >
> > Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
> > ---
> >  mm/debug_vm_pgtable.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> > index 7731b238b534..0f5ddefd128a 100644
> > --- a/mm/debug_vm_pgtable.c
> > +++ b/mm/debug_vm_pgtable.c
> > @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
> >
> >       /* Free page table entries */
> >       if (args->start_ptep) {
> > +             pmd_clear(args->pmdp);
> >               pte_free(args->mm, args->start_ptep);
> >               mm_dec_nr_ptes(args->mm);
> >       }
> >
> >       if (args->start_pmdp) {
> > +             pud_clear(args->pudp);
> >               pmd_free(args->mm, args->start_pmdp);
> >               mm_dec_nr_pmds(args->mm);
> >       }
> >
> >       if (args->start_pudp) {
> > +             p4d_clear(args->p4dp);
> >               pud_free(args->mm, args->start_pudp);
> >               mm_dec_nr_puds(args->mm);
> >       }
> >
> > -     if (args->start_p4dp)
> > +     if (args->start_p4dp) {
> > +             pgd_clear(args->pgdp);
> >               p4d_free(args->mm, args->start_p4dp);
> > +     }
> >
> >       /* Free vma and mm struct */
> >       if (args->vma)
> >               vm_area_free(args->vma);
> >
> >       if (args->mm)
> > -             mmdrop(args->mm);
> > +             mmput(args->mm);
> >  }
> >
> >  static struct page * __init
> A quick test on arm64 platform looked fine. It might be better to get this
> enabled and tested on multiple platforms via linux-next.
>

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Andrew Morton 6 months, 1 week ago

On Thu, 31 Jul 2025 18:40:51 -0300 "Herton R. Krzesinski" <herton@redhat.com> wrote:

> The mm/debug_vm_pagetable test allocates manually page table entries for the
> tests it runs, using also its manually allocated mm_struct. That in itself is
> ok, but when it exits, at destroy_args() it fails to clear those entries with
> the *_clear functions.
> 
> The problem is that leaves stale entries. If another process allocates
> an mm_struct with a pgd at the same address, it may end up running into
> the stale entry. This is happening in practice on a debug kernel with
> CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
> extra debugging I added (it prints a warning trace if pgtables_bytes goes
> negative, in addition to the warning at check_mm() function):

A quick shot with git-blame led me to include

Fixes: 3c9b84f044a9e ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args")
Cc: <stable@vger.kernel.org>

And `git show 3c9b84f044a9e' tell me this email didn't have enough cc's
(added).

Thanks, I'll include this in mm.git's mm-hotfixes branch and I shall
await further review activity.


> [    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
> [    2.539366] kmem_cache info
> [    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
> [    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
> (...)
> [    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
> [    2.552816] Modules linked in:
> [    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
> [    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
> [    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
> [    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
> [    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
> [    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
> [    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
> [    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
> [    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
> [    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
> [    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
> [    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
> [    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
> [    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
> [    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
> [    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
> [    2.553199] Call Trace:
> [    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
> [    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
> [    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
> [    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
> [    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
> [    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
> [    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
> [    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
> [    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
> [    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
> [    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> (...)
> [    2.558892] ---[ end trace 0000000000000000 ]---
> [    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
> [    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
> 
> Here the modprobe process ended up with an allocated mm_struct from the
> mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
> problem, since the mm_struct is initialized again etc., however, if it ends up
> using the same pgd table, it bumps into the old stale entry when clearing/freeing
> the page table entries, so it tries to free an entry already gone (that one
> which was allocated by the debug_vm_pgtable test), which also explains the
> negative pgtables_bytes since it's accounting for not allocated entries in the
> current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
> and clearing of the entries is explicitly done in the free_pgtables->
> free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
> free_pte_range path. However, the debug_vm_pgtable test does not call
> free_pgtables, since it allocates mm_struct and entries manually for its test
> and eg. not goes through page faults. So it also should clear manually the
> entries before exit at destroy_args().
> 
> This problem was noticed on a reboot X number of times test being done
> on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
> enabled. Depends on the system, but on a 100 times reboot loop the
> problem could manifest once or twice, if a process ends up getting the
> right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
> After using this patch, I couldn't reproduce/experience the problems
> anymore. I was able to reproduce the problem as well on latest upstream
> kernel (6.16).
> 
> I also modified destroy_args() to use mmput() instead of mmdrop(), there
> is no reason to hold mm_users reference and not release the mm_struct
> entirely, and in the output above with my debugging prints I already
> had patched it to use mmput, it did not fix the problem, but helped
> in the debugging as well.
> 
> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
> ---
>  mm/debug_vm_pgtable.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 7731b238b534..0f5ddefd128a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
>  
>  	/* Free page table entries */
>  	if (args->start_ptep) {
> +		pmd_clear(args->pmdp);
>  		pte_free(args->mm, args->start_ptep);
>  		mm_dec_nr_ptes(args->mm);
>  	}
>  
>  	if (args->start_pmdp) {
> +		pud_clear(args->pudp);
>  		pmd_free(args->mm, args->start_pmdp);
>  		mm_dec_nr_pmds(args->mm);
>  	}
>  
>  	if (args->start_pudp) {
> +		p4d_clear(args->p4dp);
>  		pud_free(args->mm, args->start_pudp);
>  		mm_dec_nr_puds(args->mm);
>  	}
>  
> -	if (args->start_p4dp)
> +	if (args->start_p4dp) {
> +		pgd_clear(args->pgdp);
>  		p4d_free(args->mm, args->start_p4dp);
> +	}
>  
>  	/* Free vma and mm struct */
>  	if (args->vma)
>  		vm_area_free(args->vma);
>  
>  	if (args->mm)
> -		mmdrop(args->mm);
> +		mmput(args->mm);
>  }
>  
>  static struct page * __init
> -- 
> 2.47.1

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Anshuman Khandual 6 months, 1 week ago

On 02/08/25 2:20 AM, Andrew Morton wrote:
> On Thu, 31 Jul 2025 18:40:51 -0300 "Herton R. Krzesinski" <herton@redhat.com> wrote:
> 
>> The mm/debug_vm_pagetable test allocates manually page table entries for the
>> tests it runs, using also its manually allocated mm_struct. That in itself is
>> ok, but when it exits, at destroy_args() it fails to clear those entries with
>> the *_clear functions.
>>
>> The problem is that leaves stale entries. If another process allocates
>> an mm_struct with a pgd at the same address, it may end up running into
>> the stale entry. This is happening in practice on a debug kernel with
>> CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
>> extra debugging I added (it prints a warning trace if pgtables_bytes goes
>> negative, in addition to the warning at check_mm() function):
> 
> A quick shot with git-blame led me to include
> 
> Fixes: 3c9b84f044a9e ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args")
> Cc: <stable@vger.kernel.org>

Agreed.

> 
> And `git show 3c9b84f044a9e' tell me this email didn't have enough cc's
> (added).

Sure, that makes sense.

> 
> Thanks, I'll include this in mm.git's mm-hotfixes branch and I shall
> await further review activity.

Right - it will be great to have this tested across other supporting platforms.

> 
> 
>> [    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
>> [    2.539366] kmem_cache info
>> [    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
>> [    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
>> (...)
>> [    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
>> [    2.552816] Modules linked in:
>> [    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
>> [    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
>> [    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
>> [    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
>> [    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
>> [    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
>> [    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
>> [    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
>> [    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
>> [    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
>> [    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
>> [    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
>> [    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
>> [    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
>> [    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
>> [    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
>> [    2.553199] Call Trace:
>> [    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
>> [    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
>> [    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
>> [    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
>> [    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
>> [    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
>> [    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
>> [    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
>> [    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
>> [    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
>> [    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
>> (...)
>> [    2.558892] ---[ end trace 0000000000000000 ]---
>> [    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
>> [    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
>>
>> Here the modprobe process ended up with an allocated mm_struct from the
>> mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
>> problem, since the mm_struct is initialized again etc., however, if it ends up
>> using the same pgd table, it bumps into the old stale entry when clearing/freeing
>> the page table entries, so it tries to free an entry already gone (that one
>> which was allocated by the debug_vm_pgtable test), which also explains the
>> negative pgtables_bytes since it's accounting for not allocated entries in the
>> current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
>> and clearing of the entries is explicitly done in the free_pgtables->
>> free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
>> free_pte_range path. However, the debug_vm_pgtable test does not call
>> free_pgtables, since it allocates mm_struct and entries manually for its test
>> and eg. not goes through page faults. So it also should clear manually the
>> entries before exit at destroy_args().
>>
>> This problem was noticed on a reboot X number of times test being done
>> on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
>> enabled. Depends on the system, but on a 100 times reboot loop the
>> problem could manifest once or twice, if a process ends up getting the
>> right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
>> After using this patch, I couldn't reproduce/experience the problems
>> anymore. I was able to reproduce the problem as well on latest upstream
>> kernel (6.16).
>>
>> I also modified destroy_args() to use mmput() instead of mmdrop(), there
>> is no reason to hold mm_users reference and not release the mm_struct
>> entirely, and in the output above with my debugging prints I already
>> had patched it to use mmput, it did not fix the problem, but helped
>> in the debugging as well.
>>
>> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
>> ---
>>  mm/debug_vm_pgtable.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>> index 7731b238b534..0f5ddefd128a 100644
>> --- a/mm/debug_vm_pgtable.c
>> +++ b/mm/debug_vm_pgtable.c
>> @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
>>  
>>  	/* Free page table entries */
>>  	if (args->start_ptep) {
>> +		pmd_clear(args->pmdp);
>>  		pte_free(args->mm, args->start_ptep);
>>  		mm_dec_nr_ptes(args->mm);
>>  	}
>>  
>>  	if (args->start_pmdp) {
>> +		pud_clear(args->pudp);
>>  		pmd_free(args->mm, args->start_pmdp);
>>  		mm_dec_nr_pmds(args->mm);
>>  	}
>>  
>>  	if (args->start_pudp) {
>> +		p4d_clear(args->p4dp);
>>  		pud_free(args->mm, args->start_pudp);
>>  		mm_dec_nr_puds(args->mm);
>>  	}
>>  
>> -	if (args->start_p4dp)
>> +	if (args->start_p4dp) {
>> +		pgd_clear(args->pgdp);
>>  		p4d_free(args->mm, args->start_p4dp);
>> +	}
>>  
>>  	/* Free vma and mm struct */
>>  	if (args->vma)
>>  		vm_area_free(args->vma);
>>  
>>  	if (args->mm)
>> -		mmdrop(args->mm);
>> +		mmput(args->mm);
>>  }
>>  
>>  static struct page * __init
>> -- 
>> 2.47.1

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Anshuman Khandual 5 months, 3 weeks ago

This has been on linux-next for almost last two weeks now and
no problem has been reported. So I guess it's all good.

On 01/08/25 3:10 AM, Herton R. Krzesinski wrote:
> The mm/debug_vm_pagetable test allocates manually page table entries for the
> tests it runs, using also its manually allocated mm_struct. That in itself is
> ok, but when it exits, at destroy_args() it fails to clear those entries with
> the *_clear functions.
> 
> The problem is that leaves stale entries. If another process allocates
> an mm_struct with a pgd at the same address, it may end up running into
> the stale entry. This is happening in practice on a debug kernel with
> CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
> extra debugging I added (it prints a warning trace if pgtables_bytes goes
> negative, in addition to the warning at check_mm() function):
> 
> [    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
> [    2.539366] kmem_cache info
> [    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
> [    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
> (...)
> [    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
> [    2.552816] Modules linked in:
> [    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
> [    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
> [    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
> [    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
> [    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
> [    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
> [    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
> [    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
> [    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
> [    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
> [    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
> [    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
> [    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
> [    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
> [    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
> [    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
> [    2.553199] Call Trace:
> [    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
> [    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
> [    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
> [    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
> [    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
> [    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
> [    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
> [    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
> [    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
> [    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
> [    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> (...)
> [    2.558892] ---[ end trace 0000000000000000 ]---
> [    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
> [    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
> 
> Here the modprobe process ended up with an allocated mm_struct from the
> mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
> problem, since the mm_struct is initialized again etc., however, if it ends up
> using the same pgd table, it bumps into the old stale entry when clearing/freeing
> the page table entries, so it tries to free an entry already gone (that one
> which was allocated by the debug_vm_pgtable test), which also explains the
> negative pgtables_bytes since it's accounting for not allocated entries in the
> current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
> and clearing of the entries is explicitly done in the free_pgtables->
> free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
> free_pte_range path. However, the debug_vm_pgtable test does not call
> free_pgtables, since it allocates mm_struct and entries manually for its test
> and eg. not goes through page faults. So it also should clear manually the
> entries before exit at destroy_args().
> 
> This problem was noticed on a reboot X number of times test being done
> on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
> enabled. Depends on the system, but on a 100 times reboot loop the
> problem could manifest once or twice, if a process ends up getting the
> right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
> After using this patch, I couldn't reproduce/experience the problems
> anymore. I was able to reproduce the problem as well on latest upstream
> kernel (6.16).
> 
> I also modified destroy_args() to use mmput() instead of mmdrop(), there
> is no reason to hold mm_users reference and not release the mm_struct
> entirely, and in the output above with my debugging prints I already
> had patched it to use mmput, it did not fix the problem, but helped
> in the debugging as well.
> 
> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
> ---
>  mm/debug_vm_pgtable.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 7731b238b534..0f5ddefd128a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
>  
>  	/* Free page table entries */
>  	if (args->start_ptep) {
> +		pmd_clear(args->pmdp);
>  		pte_free(args->mm, args->start_ptep);
>  		mm_dec_nr_ptes(args->mm);
>  	}
>  
>  	if (args->start_pmdp) {
> +		pud_clear(args->pudp);
>  		pmd_free(args->mm, args->start_pmdp);
>  		mm_dec_nr_pmds(args->mm);
>  	}
>  
>  	if (args->start_pudp) {
> +		p4d_clear(args->p4dp);
>  		pud_free(args->mm, args->start_pudp);
>  		mm_dec_nr_puds(args->mm);
>  	}
>  
> -	if (args->start_p4dp)
> +	if (args->start_p4dp) {
> +		pgd_clear(args->pgdp);
>  		p4d_free(args->mm, args->start_p4dp);
> +	}
>  
>  	/* Free vma and mm struct */
>  	if (args->vma)
>  		vm_area_free(args->vma);
>  
>  	if (args->mm)
> -		mmdrop(args->mm);
> +		mmput(args->mm);
>  }
>  
>  static struct page * __init

Re: [PATCH] mm/debug_vm_pgtable: clear page table entries at destroy_args()

Posted by Andrew Morton 5 months, 3 weeks ago

On Thu, 14 Aug 2025 16:16:03 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:

> On 01/08/25 3:10 AM, Herton R. Krzesinski wrote:
> > The mm/debug_vm_pagetable test allocates manually page table entries for the
> > tests it runs, using also its manually allocated mm_struct. That in itself is
> > ok, but when it exits, at destroy_args() it fails to clear those entries with
> > the *_clear functions.
> > 
> > The problem is that leaves stale entries. If another process allocates
> > an mm_struct with a pgd at the same address, it may end up running into
> > the stale entry. This is happening in practice on a debug kernel with
> > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
> > extra debugging I added (it prints a warning trace if pgtables_bytes goes
> > negative, in addition to the warning at check_mm() function):
>
> This has been on linux-next for almost last two weeks now and
> no problem has been reported. So I guess it's all good.
> 

[top-posting repaired]

Thanks, I'll move this into the next batch for sending into mainline.