mm/page_table_check.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
mapping and its pages are installed into userspace with vmf_insert_pfn(),
which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
pte_user_accessible_page() only tests the PRESENT/USER bits and does not
exclude special PTEs, so page_table_check accounts these PFN mappings in
the per-page anon/file map counters even though they are not rmap-managed
pages (vm_normal_page() returns NULL for them).
Most of these data pages live in the kernel image and are never freed, so
the stray accounting is invisible. The time-namespace VVAR page is the
exception: it is a real alloc_page() page that is released with
__free_page() in free_time_ns() when the last task of a time namespace
exits. Across the map / unmap / vdso_join_timens() zap transitions the
special-PTE accounting is not balanced for this page, so a non-zero
file_map_count survives to the free path and trips:
kernel BUG at mm/page_table_check.c:143!
__page_table_check_zero+0xfb/0x130
__free_frozen_pages+0x52f/0x650
free_time_ns+0x85/0xc0
free_nsproxy+0x7f/0x130
do_exit+0x313/0xa60
do_group_exit+0x77/0x90
This is reliably reproducible on x86_64 and arm64 under heavy container/CI
churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
runc / docker-init / tini), and was independently reported by syzbot on
riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
Special PTEs have no struct-page rmap semantics and must never have been
tracked by page table check. Skip them in both the set and clear paths so
the counters stay balanced (always zero) for PFN-mapped pages, regardless
of how the architecture defines pte_user_accessible_page(). pte_special()
is available generically (it is a no-op returning false on architectures
without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
("vdso/datastore: Allocate data pages dynamically") incidentally avoids
the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
with balanced struct-page accounting. This patch fixes the still-affected
VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
page_table_check robust against any future PFN-mapped user pages.
Fixes: df4e817b7108 ("mm: page table check")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reported-by: syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com
Closes: https://github.com/siderolabs/talos/issues/13496
Cc: stable@vger.kernel.org
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
---
mm/page_table_check.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 4eeca782b888..ee492d5389b9 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -150,9 +150,16 @@ void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte)
if (&init_mm == mm)
return;
- if (pte_user_accessible_page(pte)) {
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(pte) && !pte_special(pte))
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
- }
}
EXPORT_SYMBOL(__page_table_check_pte_clear);
@@ -205,7 +212,7 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, ptep_get(ptep + i));
- if (pte_user_accessible_page(pte))
+ if (pte_user_accessible_page(pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
--
2.53.0
On 6/8/26 17:57, Andrey Smirnov wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
Using pte_special() is usually a sign that something is likely shaky, as it
misses architectures that don't support CONFIG_ARCH_HAS_PTE_SPECIAL.
I assume relevant architectures (loongarch32?) do not support
CONFIG_PAGE_TABLE_CHECK.
arch/arm64/Kconfig: select ARCH_SUPPORTS_PAGE_TABLE_CHECK
arch/powerpc/Kconfig: select ARCH_SUPPORTS_PAGE_TABLE_CHECK if !HUGETLB_PAGE
arch/riscv/Kconfig: select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU
arch/s390/Kconfig: select ARCH_SUPPORTS_PAGE_TABLE_CHECK
arch/x86/Kconfig: select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64
mm/Kconfig.debug: depends on ARCH_SUPPORTS_PAGE_TABLE_CHECK
Can we enforce somehow that we expect CONFIG_ARCH_HAS_PTE_SPECIAL, so anybody
unlocking ARCH_SUPPORTS_PAGE_TABLE_CHECK is aware of this?
For example, through a BUILD_BUG_ON?
--
Cheers,
David
On Mon, Jun 08, 2026 at 07:57:58PM +0400, Andrey Smirnov wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
>
> Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> with balanced struct-page accounting. This patch fixes the still-affected
> VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> page_table_check robust against any future PFN-mapped user pages.
>
> Fixes: df4e817b7108 ("mm: page table check")
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Reported-by: syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com
> Closes: https://github.com/siderolabs/talos/issues/13496
> Cc: stable@vger.kernel.org
> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
> ---
> mm/page_table_check.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 4eeca782b888..ee492d5389b9 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -150,9 +150,16 @@ void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte)
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(pte)) {
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
As this comment mentioning the [vvar] pages is already stale, IMO this should
not be mentioned specifically. It is also not clear to me why this only happens
now and where the non-zero map count comes from.
> + */
> + if (pte_user_accessible_page(pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> - }
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
>
> @@ -205,7 +212,7 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, ptep_get(ptep + i));
> - if (pte_user_accessible_page(pte))
> + if (pte_user_accessible_page(pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> --
> 2.53.0
>
On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
>
> Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> with balanced struct-page accounting. This patch fixes the still-affected
> VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> page_table_check robust against any future PFN-mapped user pages.
Thanks.
The patch isn't applicable to current -linus mainline. I reworked it
as below, then deleted it. It would be better if this rework came from
yourself (tested), please. And a patch which applies will get checked
by Sashiko AI review.
--- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
+++ a/mm/page_table_check.c
@@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
if (&init_mm == mm)
return;
- if (pte_user_accessible_page(mm, addr, pte))
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
}
EXPORT_SYMBOL(__page_table_check_pte_clear);
@@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
- if (pte_user_accessible_page(mm, addr, pte))
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
_
On 06-08 14:22, Andrew Morton wrote:
> On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
>
> > The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> > mapping and its pages are installed into userspace with vmf_insert_pfn(),
> > which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> > pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> > exclude special PTEs, so page_table_check accounts these PFN mappings in
> > the per-page anon/file map counters even though they are not rmap-managed
> > pages (vm_normal_page() returns NULL for them).
> >
> > Most of these data pages live in the kernel image and are never freed, so
> > the stray accounting is invisible. The time-namespace VVAR page is the
> > exception: it is a real alloc_page() page that is released with
> > __free_page() in free_time_ns() when the last task of a time namespace
> > exits. Across the map / unmap / vdso_join_timens() zap transitions the
> > special-PTE accounting is not balanced for this page, so a non-zero
> > file_map_count survives to the free path and trips:
> >
> > kernel BUG at mm/page_table_check.c:143!
> > __page_table_check_zero+0xfb/0x130
> > __free_frozen_pages+0x52f/0x650
> > free_time_ns+0x85/0xc0
> > free_nsproxy+0x7f/0x130
> > do_exit+0x313/0xa60
> > do_group_exit+0x77/0x90
> >
> > This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> > churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> > runc / docker-init / tini), and was independently reported by syzbot on
> > riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
> >
> > Special PTEs have no struct-page rmap semantics and must never have been
> > tracked by page table check. Skip them in both the set and clear paths so
> > the counters stay balanced (always zero) for PFN-mapped pages, regardless
> > of how the architecture defines pte_user_accessible_page(). pte_special()
> > is available generically (it is a no-op returning false on architectures
> > without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
> >
> > Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> > ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> > the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> > with balanced struct-page accounting. This patch fixes the still-affected
> > VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> > page_table_check robust against any future PFN-mapped user pages.
Thank you for detailed explanation of the bug, and it makes sense to me.
> Thanks.
>
> The patch isn't applicable to current -linus mainline. I reworked it
> as below, then deleted it. It would be better if this rework came from
> yourself (tested), please. And a patch which applies will get checked
> by Sashiko AI review.
+1.
Pasha
> --- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
> +++ a/mm/page_table_check.c
> @@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(mm, addr, pte))
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
> + */
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
> @@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
> - if (pte_user_accessible_page(mm, addr, pte))
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> _
>
© 2016 - 2026 Red Hat, Inc.