kernel/bpf/arena.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
apply_range_set_cb() maps the pages for a new arena allocation. It returned
-EBUSY when the target PTE was already populated.
Kernel-fault recovery can leave the per-arena scratch page in an otherwise
unallocated arena PTE: a BPF program that touched an unallocated address
gets the scratch page installed there. A later bpf_arena_alloc_pages()
covering that page then finds the PTE populated, returns -EBUSY, and leaves
the scratch page in place. Every subsequent allocation of that page fails
the same way.
Drop the must-be-empty check so set_pte_at() installs the real page
unconditionally, overwriting a scratch (or empty) PTE.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
---
kernel/bpf/arena.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -131,15 +131,13 @@ static int apply_range_set_cb(pte_t *pte
if (!data)
return 0;
- /* sanity check */
- if (unlikely(!pte_none(ptep_get(pte))))
- return -EBUSY;
page = d->pages[d->i];
/* paranoia, similar to vmap_pages_pte_range() */
if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
return -EINVAL;
+ /* May be none or the scratch page, overwrite either way */
set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
d->i++;
return 0;
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 1727503b25d8..520224dc45c9 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -157,15 +157,13 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>
> if (!data)
> return 0;
> - /* sanity check */
> - if (unlikely(!pte_none(ptep_get(pte))))
> - return -EBUSY;
>
> page = d->pages[d->i];
> /* paranoia, similar to vmap_pages_pte_range() */
> if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> return -EINVAL;
>
> + /* May be none or the scratch page, overwrite either way */
> set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
^^^^^^^^^^
When set_pte_at() overwrites a present PTE (the scratch page) with a
different PFN (the new real page), does the arena allocation path flush
the TLB?
The caller arena_alloc_pages() only calls flush_vmap_cache() after
apply_to_page_range(), which expands to flush_cache_vmap() (a no-op on
x86). This flushes data caches but not TLB entries.
Before this patch, apply_range_set_cb() only transitioned not-present to
present, which requires no TLB flush. But now it can transition
present(scratch) to present(real), changing the underlying PFN.
When a BPF program faults on an unallocated arena address, the
kernel-fault recovery path installs the scratch page into that PTE:
bpf_arena_handle_page_fault()
-> apply_range_set_scratch_cb()
-> ptep_try_set()
The faulting CPU then retries the access, loading a TLB entry that maps
the arena kern_vm address to the scratch_page PFN. This entry becomes
stale when a later arena_alloc_pages() overwrites the PTE to point to a
real page.
The arena kern_vm region is mapped in init_mm and shared across all CPUs.
Without a TLB flush, can a CPU holding a stale entry continue
reading/writing the shared scratch page instead of the new real page?
The free path in arena_free_pages() and arena_free_worker() already calls
flush_tlb_kernel_range() after clearing arena PTEs:
arena_free_pages():
ptep_get_and_clear(...)
flush_tlb_kernel_range(start, end);
arena_free_worker():
apply_to_page_range(...)
flush_tlb_kernel_range(...)
Does the alloc path need a similar flush when overwriting a present PTE?
> d->i++;
> return 0;
---
> commit 066af9e73183b42ad5ee27805163ce25053ed9bf
> Author: Tejun Heo <tj@kernel.org>
>
> bpf: Overwrite scratch PTE when allocating arena pages
>
> apply_range_set_cb() maps the pages for a new arena allocation. It returned
> -EBUSY when the target PTE was already populated.
>
> Kernel-fault recovery can leave the per-arena scratch page in an otherwise
> unallocated arena PTE: a BPF program that touched an unallocated address
> gets the scratch page installed there. A later bpf_arena_alloc_pages()
> covering that page then finds the PTE populated, returns -EBUSY, and leaves
> the scratch page in place. Every subsequent allocation of that page fails
> the same way.
>
> Drop the must-be-empty check so set_pte_at() installs the real page
> unconditionally, overwriting a scratch (or empty) PTE.
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Hildenbrand <david@kernel.org>
This looks like a bug fix for the scratch page allocation failure.
Should this include:
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26718992256
© 2016 - 2026 Red Hat, Inc.