mm/debug_vm_pgtable.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
The mm/debug_vm_pagetable test allocates manually page table entries for the
tests it runs, using also its manually allocated mm_struct. That in itself is
ok, but when it exits, at destroy_args() it fails to clear those entries with
the *_clear functions.
The problem is that leaves stale entries. If another process allocates
an mm_struct with a pgd at the same address, it may end up running into
the stale entry. This is happening in practice on a debug kernel with
CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some
extra debugging I added (it prints a warning trace if pgtables_bytes goes
negative, in addition to the warning at check_mm() function):
[ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000
[ 2.539366] kmem_cache info
[ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
[ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e
(...)
[ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
[ 2.552816] Modules linked in:
[ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
[ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
[ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
[ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug)
[ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a
[ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
[ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
[ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
[ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
[ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
[ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
[ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
[ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
[ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
[ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
[ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
[ 2.553199] Call Trace:
[ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
[ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
[ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
[ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
[ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
[ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
[ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
[ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
[ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
[ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
[ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
(...)
[ 2.558892] ---[ end trace 0000000000000000 ]---
[ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
[ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
Here the modprobe process ended up with an allocated mm_struct from the
mm_struct slab that was used before by the debug_vm_pgtable test. That is not a
problem, since the mm_struct is initialized again etc., however, if it ends up
using the same pgd table, it bumps into the old stale entry when clearing/freeing
the page table entries, so it tries to free an entry already gone (that one
which was allocated by the debug_vm_pgtable test), which also explains the
negative pgtables_bytes since it's accounting for not allocated entries in the
current process. As far as I looked pgd_{alloc,free} etc. does not clear entries,
and clearing of the entries is explicitly done in the free_pgtables->
free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
free_pte_range path. However, the debug_vm_pgtable test does not call
free_pgtables, since it allocates mm_struct and entries manually for its test
and eg. not goes through page faults. So it also should clear manually the
entries before exit at destroy_args().
This problem was noticed on a reboot X number of times test being done
on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE
enabled. Depends on the system, but on a 100 times reboot loop the
problem could manifest once or twice, if a process ends up getting the
right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable.
After using this patch, I couldn't reproduce/experience the problems
anymore. I was able to reproduce the problem as well on latest upstream
kernel (6.16).
I also modified destroy_args() to use mmput() instead of mmdrop(), there
is no reason to hold mm_users reference and not release the mm_struct
entirely, and in the output above with my debugging prints I already
had patched it to use mmput, it did not fix the problem, but helped
in the debugging as well.
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
---
mm/debug_vm_pgtable.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 7731b238b534..0f5ddefd128a 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args)
/* Free page table entries */
if (args->start_ptep) {
+ pmd_clear(args->pmdp);
pte_free(args->mm, args->start_ptep);
mm_dec_nr_ptes(args->mm);
}
if (args->start_pmdp) {
+ pud_clear(args->pudp);
pmd_free(args->mm, args->start_pmdp);
mm_dec_nr_pmds(args->mm);
}
if (args->start_pudp) {
+ p4d_clear(args->p4dp);
pud_free(args->mm, args->start_pudp);
mm_dec_nr_puds(args->mm);
}
- if (args->start_p4dp)
+ if (args->start_p4dp) {
+ pgd_clear(args->pgdp);
p4d_free(args->mm, args->start_p4dp);
+ }
/* Free vma and mm struct */
if (args->vma)
vm_area_free(args->vma);
if (args->mm)
- mmdrop(args->mm);
+ mmput(args->mm);
}
static struct page * __init
--
2.47.1
Hello Herton, On 01/08/25 3:10 AM, Herton R. Krzesinski wrote: > The mm/debug_vm_pagetable test allocates manually page table entries for the > tests it runs, using also its manually allocated mm_struct. That in itself is > ok, but when it exits, at destroy_args() it fails to clear those entries with > the *_clear functions. > > The problem is that leaves stale entries. If another process allocates > an mm_struct with a pgd at the same address, it may end up running into > the stale entry. This is happening in practice on a debug kernel with Should not the allocators ensure that the allocated memory elements are all cleaned up before using them ? > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some > extra debugging I added (it prints a warning trace if pgtables_bytes goes > negative, in addition to the warning at check_mm() function): > > [ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000 > [ 2.539366] kmem_cache info > [ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508 > [ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e > (...) > [ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0 > [ 2.552816] Modules linked in: > [ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY > [ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries > [ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90 > [ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug) > [ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a > [ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0 > [ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001 > [ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff > [ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000 > [ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb > [ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0 > [ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000 > [ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001 > [ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760 > [ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0 > [ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0 > [ 2.553199] Call Trace: > [ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable) > [ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0 > [ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570 > [ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650 > [ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290 > [ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0 > [ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870 > [ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150 > [ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50 > [ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0 > [ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > (...) > [ 2.558892] ---[ end trace 0000000000000000 ]--- > [ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1 > [ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144 > > Here the modprobe process ended up with an allocated mm_struct from the > mm_struct slab that was used before by the debug_vm_pgtable test. That is not a > problem, since the mm_struct is initialized again etc., however, if it ends up > using the same pgd table, it bumps into the old stale entry when clearing/freeing > the page table entries, so it tries to free an entry already gone (that one > which was allocated by the debug_vm_pgtable test), which also explains the How did you ensure that it was allocated from debug_vm_pgtable ? Trace prints during its execution and then matching up the addresses ? Just curious. > negative pgtables_bytes since it's accounting for not allocated entries in the > current process. As far as I looked pgd_{alloc,free} etc. does not clear entries, So should they clear entries or doing so would add to overall latency ? > and clearing of the entries is explicitly done in the free_pgtables-> > free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range-> > free_pte_range path. However, the debug_vm_pgtable test does not call > free_pgtables, since it allocates mm_struct and entries manually for its test > and eg. not goes through page faults. So it also should clear manually the > entries before exit at destroy_args(). Makes sense. > > This problem was noticed on a reboot X number of times test being done > on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE > enabled. Depends on the system, but on a 100 times reboot loop the > problem could manifest once or twice, if a process ends up getting the > right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable. > After using this patch, I couldn't reproduce/experience the problems > anymore. I was able to reproduce the problem as well on latest upstream > kernel (6.16). Seems like a very rare case i.e both to reproduce and also to confirm if this patch here has indeed solved the problem. Just wondering - did you try to reproduce this problem on any other platform than powerpc ? > > I also modified destroy_args() to use mmput() instead of mmdrop(), there > is no reason to hold mm_users reference and not release the mm_struct > entirely, and in the output above with my debugging prints I already > had patched it to use mmput, it did not fix the problem, but helped > in the debugging as well. Makes sense. > > Signed-off-by: Herton R. Krzesinski <herton@redhat.com> > --- > mm/debug_vm_pgtable.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 7731b238b534..0f5ddefd128a 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args) > > /* Free page table entries */ > if (args->start_ptep) { > + pmd_clear(args->pmdp); > pte_free(args->mm, args->start_ptep); > mm_dec_nr_ptes(args->mm); > } > > if (args->start_pmdp) { > + pud_clear(args->pudp); > pmd_free(args->mm, args->start_pmdp); > mm_dec_nr_pmds(args->mm); > } > > if (args->start_pudp) { > + p4d_clear(args->p4dp); > pud_free(args->mm, args->start_pudp); > mm_dec_nr_puds(args->mm); > } > > - if (args->start_p4dp) > + if (args->start_p4dp) { > + pgd_clear(args->pgdp); > p4d_free(args->mm, args->start_p4dp); > + } > > /* Free vma and mm struct */ > if (args->vma) > vm_area_free(args->vma); > > if (args->mm) > - mmdrop(args->mm); > + mmput(args->mm); > } > > static struct page * __init A quick test on arm64 platform looked fine. It might be better to get this enabled and tested on multiple platforms via linux-next.
On Thu, Jul 31, 2025 at 11:41 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > Hello Herton, > > On 01/08/25 3:10 AM, Herton R. Krzesinski wrote: > > The mm/debug_vm_pagetable test allocates manually page table entries for the > > tests it runs, using also its manually allocated mm_struct. That in itself is > > ok, but when it exits, at destroy_args() it fails to clear those entries with > > the *_clear functions. > > > > The problem is that leaves stale entries. If another process allocates > > an mm_struct with a pgd at the same address, it may end up running into > > the stale entry. This is happening in practice on a debug kernel with > > Should not the allocators ensure that the allocated memory elements are > all cleaned up before using them ? I did not saw anything which cleaned them. all the pgd/pud etc. alloc functions do not clean them, so I think that's the default behaviour from what I understand. I also used crash utility on a live kernel reading the pgd address from the mm_struct that was allocated from the debug_vm_pgtable test and already freed and saw that it was populated even after it was freed. > > > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some > > extra debugging I added (it prints a warning trace if pgtables_bytes goes > > negative, in addition to the warning at check_mm() function): > > > > [ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000 > > [ 2.539366] kmem_cache info > > [ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508 > > [ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e > > (...) > > [ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0 > > [ 2.552816] Modules linked in: > > [ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY > > [ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries > > [ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90 > > [ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug) > > [ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a > > [ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0 > > [ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001 > > [ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff > > [ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000 > > [ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb > > [ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0 > > [ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000 > > [ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001 > > [ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760 > > [ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0 > > [ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0 > > [ 2.553199] Call Trace: > > [ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable) > > [ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0 > > [ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570 > > [ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650 > > [ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290 > > [ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0 > > [ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870 > > [ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150 > > [ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50 > > [ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0 > > [ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > > (...) > > [ 2.558892] ---[ end trace 0000000000000000 ]--- > > [ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1 > > [ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144 > > > > Here the modprobe process ended up with an allocated mm_struct from the > > mm_struct slab that was used before by the debug_vm_pgtable test. That is not a > > problem, since the mm_struct is initialized again etc., however, if it ends up > > using the same pgd table, it bumps into the old stale entry when clearing/freeing > > the page table entries, so it tries to free an entry already gone (that one > > which was allocated by the debug_vm_pgtable test), which also explains the > > How did you ensure that it was allocated from debug_vm_pgtable ? Trace prints during > its execution and then matching up the addresses ? Just curious. Usually the mm_struct address would match, but the problem is the pgd address, the pgd address allocated for the mm_struct matched. Yes trace prints and the problem happening with same mm_struct->pgd. Also disabling CONFIG_DEBUG_VM_PGTABLE also made the problem go away. It was "easy" to reproduce on a powerpc machine (with a reboot loop, in my case sometimes on a 50 or 100 times, since the test executes only on early boot), if another process after it got the same mm->pgd by accident it would get the problem (from experience looking into the issue, it would happen on boot with udev firing lots of modprobe and one eventually got the mm_struct from the slab and same pgd that was used before). What lead me investigating into this was that I saw some reports of "BUG: non-zero pgtables_bytes on freeing mm" messages reports, sometimes then followed by corruption/panic usually related to page table entries, on that reboot loop test. Then I was able to determine that CONFIG_DEBUG_VM_PGTABLE was to blame, and from there find out that even disabling the tests manually, only allocing the pgtable entries was enough to trigger the issue. > > > negative pgtables_bytes since it's accounting for not allocated entries in the > > current process. As far as I looked pgd_{alloc,free} etc. does not clear entries, > So should they clear entries or doing so would add to overall latency ? > > > and clearing of the entries is explicitly done in the free_pgtables-> > > free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range-> > > free_pte_range path. However, the debug_vm_pgtable test does not call > > free_pgtables, since it allocates mm_struct and entries manually for its test > > and eg. not goes through page faults. So it also should clear manually the > > entries before exit at destroy_args(). > > Makes sense. > > > > > This problem was noticed on a reboot X number of times test being done > > on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE > > enabled. Depends on the system, but on a 100 times reboot loop the > > problem could manifest once or twice, if a process ends up getting the > > right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable. > > After using this patch, I couldn't reproduce/experience the problems > > anymore. I was able to reproduce the problem as well on latest upstream > > kernel (6.16). > > Seems like a very rare case i.e both to reproduce and also to confirm if this patch > here has indeed solved the problem. Just wondering - did you try to reproduce this > problem on any other platform than powerpc ? I only tried and then reproduced on ppc, since all reports I saw was reproducing on it, didn't saw reports on other architectures. I tested the patch on ppc/x86/s390/arm64 with a bigger X times reboot loop test (for the test, a 200 times reboot loop). From my understanding, another process getting the same mm->pgd was the key, so if a process got lucky enough it triggers the issue. > > > > > I also modified destroy_args() to use mmput() instead of mmdrop(), there > > is no reason to hold mm_users reference and not release the mm_struct > > entirely, and in the output above with my debugging prints I already > > had patched it to use mmput, it did not fix the problem, but helped > > in the debugging as well. > > Makes sense. > > > > > Signed-off-by: Herton R. Krzesinski <herton@redhat.com> > > --- > > mm/debug_vm_pgtable.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > > index 7731b238b534..0f5ddefd128a 100644 > > --- a/mm/debug_vm_pgtable.c > > +++ b/mm/debug_vm_pgtable.c > > @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args) > > > > /* Free page table entries */ > > if (args->start_ptep) { > > + pmd_clear(args->pmdp); > > pte_free(args->mm, args->start_ptep); > > mm_dec_nr_ptes(args->mm); > > } > > > > if (args->start_pmdp) { > > + pud_clear(args->pudp); > > pmd_free(args->mm, args->start_pmdp); > > mm_dec_nr_pmds(args->mm); > > } > > > > if (args->start_pudp) { > > + p4d_clear(args->p4dp); > > pud_free(args->mm, args->start_pudp); > > mm_dec_nr_puds(args->mm); > > } > > > > - if (args->start_p4dp) > > + if (args->start_p4dp) { > > + pgd_clear(args->pgdp); > > p4d_free(args->mm, args->start_p4dp); > > + } > > > > /* Free vma and mm struct */ > > if (args->vma) > > vm_area_free(args->vma); > > > > if (args->mm) > > - mmdrop(args->mm); > > + mmput(args->mm); > > } > > > > static struct page * __init > A quick test on arm64 platform looked fine. It might be better to get this > enabled and tested on multiple platforms via linux-next. >
On Thu, 31 Jul 2025 18:40:51 -0300 "Herton R. Krzesinski" <herton@redhat.com> wrote: > The mm/debug_vm_pagetable test allocates manually page table entries for the > tests it runs, using also its manually allocated mm_struct. That in itself is > ok, but when it exits, at destroy_args() it fails to clear those entries with > the *_clear functions. > > The problem is that leaves stale entries. If another process allocates > an mm_struct with a pgd at the same address, it may end up running into > the stale entry. This is happening in practice on a debug kernel with > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some > extra debugging I added (it prints a warning trace if pgtables_bytes goes > negative, in addition to the warning at check_mm() function): A quick shot with git-blame led me to include Fixes: 3c9b84f044a9e ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args") Cc: <stable@vger.kernel.org> And `git show 3c9b84f044a9e' tell me this email didn't have enough cc's (added). Thanks, I'll include this in mm.git's mm-hotfixes branch and I shall await further review activity. > [ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000 > [ 2.539366] kmem_cache info > [ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508 > [ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e > (...) > [ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0 > [ 2.552816] Modules linked in: > [ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY > [ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries > [ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90 > [ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug) > [ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a > [ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0 > [ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001 > [ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff > [ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000 > [ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb > [ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0 > [ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000 > [ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001 > [ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760 > [ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0 > [ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0 > [ 2.553199] Call Trace: > [ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable) > [ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0 > [ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570 > [ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650 > [ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290 > [ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0 > [ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870 > [ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150 > [ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50 > [ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0 > [ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > (...) > [ 2.558892] ---[ end trace 0000000000000000 ]--- > [ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1 > [ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144 > > Here the modprobe process ended up with an allocated mm_struct from the > mm_struct slab that was used before by the debug_vm_pgtable test. That is not a > problem, since the mm_struct is initialized again etc., however, if it ends up > using the same pgd table, it bumps into the old stale entry when clearing/freeing > the page table entries, so it tries to free an entry already gone (that one > which was allocated by the debug_vm_pgtable test), which also explains the > negative pgtables_bytes since it's accounting for not allocated entries in the > current process. As far as I looked pgd_{alloc,free} etc. does not clear entries, > and clearing of the entries is explicitly done in the free_pgtables-> > free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range-> > free_pte_range path. However, the debug_vm_pgtable test does not call > free_pgtables, since it allocates mm_struct and entries manually for its test > and eg. not goes through page faults. So it also should clear manually the > entries before exit at destroy_args(). > > This problem was noticed on a reboot X number of times test being done > on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE > enabled. Depends on the system, but on a 100 times reboot loop the > problem could manifest once or twice, if a process ends up getting the > right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable. > After using this patch, I couldn't reproduce/experience the problems > anymore. I was able to reproduce the problem as well on latest upstream > kernel (6.16). > > I also modified destroy_args() to use mmput() instead of mmdrop(), there > is no reason to hold mm_users reference and not release the mm_struct > entirely, and in the output above with my debugging prints I already > had patched it to use mmput, it did not fix the problem, but helped > in the debugging as well. > > Signed-off-by: Herton R. Krzesinski <herton@redhat.com> > --- > mm/debug_vm_pgtable.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 7731b238b534..0f5ddefd128a 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args) > > /* Free page table entries */ > if (args->start_ptep) { > + pmd_clear(args->pmdp); > pte_free(args->mm, args->start_ptep); > mm_dec_nr_ptes(args->mm); > } > > if (args->start_pmdp) { > + pud_clear(args->pudp); > pmd_free(args->mm, args->start_pmdp); > mm_dec_nr_pmds(args->mm); > } > > if (args->start_pudp) { > + p4d_clear(args->p4dp); > pud_free(args->mm, args->start_pudp); > mm_dec_nr_puds(args->mm); > } > > - if (args->start_p4dp) > + if (args->start_p4dp) { > + pgd_clear(args->pgdp); > p4d_free(args->mm, args->start_p4dp); > + } > > /* Free vma and mm struct */ > if (args->vma) > vm_area_free(args->vma); > > if (args->mm) > - mmdrop(args->mm); > + mmput(args->mm); > } > > static struct page * __init > -- > 2.47.1
On 02/08/25 2:20 AM, Andrew Morton wrote: > On Thu, 31 Jul 2025 18:40:51 -0300 "Herton R. Krzesinski" <herton@redhat.com> wrote: > >> The mm/debug_vm_pagetable test allocates manually page table entries for the >> tests it runs, using also its manually allocated mm_struct. That in itself is >> ok, but when it exits, at destroy_args() it fails to clear those entries with >> the *_clear functions. >> >> The problem is that leaves stale entries. If another process allocates >> an mm_struct with a pgd at the same address, it may end up running into >> the stale entry. This is happening in practice on a debug kernel with >> CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some >> extra debugging I added (it prints a warning trace if pgtables_bytes goes >> negative, in addition to the warning at check_mm() function): > > A quick shot with git-blame led me to include > > Fixes: 3c9b84f044a9e ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args") > Cc: <stable@vger.kernel.org> Agreed. > > And `git show 3c9b84f044a9e' tell me this email didn't have enough cc's > (added). Sure, that makes sense. > > Thanks, I'll include this in mm.git's mm-hotfixes branch and I shall > await further review activity. Right - it will be great to have this tested across other supporting platforms. > > >> [ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000 >> [ 2.539366] kmem_cache info >> [ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508 >> [ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e >> (...) >> [ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0 >> [ 2.552816] Modules linked in: >> [ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY >> [ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries >> [ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90 >> [ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug) >> [ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a >> [ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0 >> [ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001 >> [ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff >> [ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000 >> [ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb >> [ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0 >> [ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000 >> [ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001 >> [ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760 >> [ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0 >> [ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0 >> [ 2.553199] Call Trace: >> [ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable) >> [ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0 >> [ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570 >> [ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650 >> [ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290 >> [ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0 >> [ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870 >> [ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150 >> [ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50 >> [ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0 >> [ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec >> (...) >> [ 2.558892] ---[ end trace 0000000000000000 ]--- >> [ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1 >> [ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144 >> >> Here the modprobe process ended up with an allocated mm_struct from the >> mm_struct slab that was used before by the debug_vm_pgtable test. That is not a >> problem, since the mm_struct is initialized again etc., however, if it ends up >> using the same pgd table, it bumps into the old stale entry when clearing/freeing >> the page table entries, so it tries to free an entry already gone (that one >> which was allocated by the debug_vm_pgtable test), which also explains the >> negative pgtables_bytes since it's accounting for not allocated entries in the >> current process. As far as I looked pgd_{alloc,free} etc. does not clear entries, >> and clearing of the entries is explicitly done in the free_pgtables-> >> free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range-> >> free_pte_range path. However, the debug_vm_pgtable test does not call >> free_pgtables, since it allocates mm_struct and entries manually for its test >> and eg. not goes through page faults. So it also should clear manually the >> entries before exit at destroy_args(). >> >> This problem was noticed on a reboot X number of times test being done >> on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE >> enabled. Depends on the system, but on a 100 times reboot loop the >> problem could manifest once or twice, if a process ends up getting the >> right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable. >> After using this patch, I couldn't reproduce/experience the problems >> anymore. I was able to reproduce the problem as well on latest upstream >> kernel (6.16). >> >> I also modified destroy_args() to use mmput() instead of mmdrop(), there >> is no reason to hold mm_users reference and not release the mm_struct >> entirely, and in the output above with my debugging prints I already >> had patched it to use mmput, it did not fix the problem, but helped >> in the debugging as well. >> >> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> >> --- >> mm/debug_vm_pgtable.c | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c >> index 7731b238b534..0f5ddefd128a 100644 >> --- a/mm/debug_vm_pgtable.c >> +++ b/mm/debug_vm_pgtable.c >> @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args) >> >> /* Free page table entries */ >> if (args->start_ptep) { >> + pmd_clear(args->pmdp); >> pte_free(args->mm, args->start_ptep); >> mm_dec_nr_ptes(args->mm); >> } >> >> if (args->start_pmdp) { >> + pud_clear(args->pudp); >> pmd_free(args->mm, args->start_pmdp); >> mm_dec_nr_pmds(args->mm); >> } >> >> if (args->start_pudp) { >> + p4d_clear(args->p4dp); >> pud_free(args->mm, args->start_pudp); >> mm_dec_nr_puds(args->mm); >> } >> >> - if (args->start_p4dp) >> + if (args->start_p4dp) { >> + pgd_clear(args->pgdp); >> p4d_free(args->mm, args->start_p4dp); >> + } >> >> /* Free vma and mm struct */ >> if (args->vma) >> vm_area_free(args->vma); >> >> if (args->mm) >> - mmdrop(args->mm); >> + mmput(args->mm); >> } >> >> static struct page * __init >> -- >> 2.47.1
This has been on linux-next for almost last two weeks now and no problem has been reported. So I guess it's all good. On 01/08/25 3:10 AM, Herton R. Krzesinski wrote: > The mm/debug_vm_pagetable test allocates manually page table entries for the > tests it runs, using also its manually allocated mm_struct. That in itself is > ok, but when it exits, at destroy_args() it fails to clear those entries with > the *_clear functions. > > The problem is that leaves stale entries. If another process allocates > an mm_struct with a pgd at the same address, it may end up running into > the stale entry. This is happening in practice on a debug kernel with > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some > extra debugging I added (it prints a warning trace if pgtables_bytes goes > negative, in addition to the warning at check_mm() function): > > [ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000 > [ 2.539366] kmem_cache info > [ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508 > [ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e > (...) > [ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0 > [ 2.552816] Modules linked in: > [ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY > [ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries > [ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90 > [ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug) > [ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a > [ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0 > [ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001 > [ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff > [ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000 > [ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb > [ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0 > [ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000 > [ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001 > [ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760 > [ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0 > [ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0 > [ 2.553199] Call Trace: > [ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable) > [ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0 > [ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570 > [ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650 > [ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290 > [ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0 > [ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870 > [ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150 > [ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50 > [ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0 > [ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > (...) > [ 2.558892] ---[ end trace 0000000000000000 ]--- > [ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1 > [ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144 > > Here the modprobe process ended up with an allocated mm_struct from the > mm_struct slab that was used before by the debug_vm_pgtable test. That is not a > problem, since the mm_struct is initialized again etc., however, if it ends up > using the same pgd table, it bumps into the old stale entry when clearing/freeing > the page table entries, so it tries to free an entry already gone (that one > which was allocated by the debug_vm_pgtable test), which also explains the > negative pgtables_bytes since it's accounting for not allocated entries in the > current process. As far as I looked pgd_{alloc,free} etc. does not clear entries, > and clearing of the entries is explicitly done in the free_pgtables-> > free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range-> > free_pte_range path. However, the debug_vm_pgtable test does not call > free_pgtables, since it allocates mm_struct and entries manually for its test > and eg. not goes through page faults. So it also should clear manually the > entries before exit at destroy_args(). > > This problem was noticed on a reboot X number of times test being done > on a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE > enabled. Depends on the system, but on a 100 times reboot loop the > problem could manifest once or twice, if a process ends up getting the > right mm->pgd entry with the stale entries used by mm/debug_vm_pagetable. > After using this patch, I couldn't reproduce/experience the problems > anymore. I was able to reproduce the problem as well on latest upstream > kernel (6.16). > > I also modified destroy_args() to use mmput() instead of mmdrop(), there > is no reason to hold mm_users reference and not release the mm_struct > entirely, and in the output above with my debugging prints I already > had patched it to use mmput, it did not fix the problem, but helped > in the debugging as well. > > Signed-off-by: Herton R. Krzesinski <herton@redhat.com> > --- > mm/debug_vm_pgtable.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 7731b238b534..0f5ddefd128a 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -1041,29 +1041,34 @@ static void __init destroy_args(struct pgtable_debug_args *args) > > /* Free page table entries */ > if (args->start_ptep) { > + pmd_clear(args->pmdp); > pte_free(args->mm, args->start_ptep); > mm_dec_nr_ptes(args->mm); > } > > if (args->start_pmdp) { > + pud_clear(args->pudp); > pmd_free(args->mm, args->start_pmdp); > mm_dec_nr_pmds(args->mm); > } > > if (args->start_pudp) { > + p4d_clear(args->p4dp); > pud_free(args->mm, args->start_pudp); > mm_dec_nr_puds(args->mm); > } > > - if (args->start_p4dp) > + if (args->start_p4dp) { > + pgd_clear(args->pgdp); > p4d_free(args->mm, args->start_p4dp); > + } > > /* Free vma and mm struct */ > if (args->vma) > vm_area_free(args->vma); > > if (args->mm) > - mmdrop(args->mm); > + mmput(args->mm); > } > > static struct page * __init
On Thu, 14 Aug 2025 16:16:03 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote: > On 01/08/25 3:10 AM, Herton R. Krzesinski wrote: > > The mm/debug_vm_pagetable test allocates manually page table entries for the > > tests it runs, using also its manually allocated mm_struct. That in itself is > > ok, but when it exits, at destroy_args() it fails to clear those entries with > > the *_clear functions. > > > > The problem is that leaves stale entries. If another process allocates > > an mm_struct with a pgd at the same address, it may end up running into > > the stale entry. This is happening in practice on a debug kernel with > > CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some > > extra debugging I added (it prints a warning trace if pgtables_bytes goes > > negative, in addition to the warning at check_mm() function): > > This has been on linux-next for almost last two weeks now and > no problem has been reported. So I guess it's all good. > [top-posting repaired] Thanks, I'll move this into the next batch for sending into mainline.
© 2016 - 2025 Red Hat, Inc.