[PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages

Tejun Heo posted 1 patch 1 week ago
kernel/bpf/arena.c |    4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
[PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages
Posted by Tejun Heo 1 week ago
apply_range_set_cb() maps the pages for a new arena allocation. It returned
-EBUSY when the target PTE was already populated.

Kernel-fault recovery can leave the per-arena scratch page in an otherwise
unallocated arena PTE: a BPF program that touched an unallocated address
gets the scratch page installed there. A later bpf_arena_alloc_pages()
covering that page then finds the PTE populated, returns -EBUSY, and leaves
the scratch page in place. Every subsequent allocation of that page fails
the same way.

Drop the must-be-empty check so set_pte_at() installs the real page
unconditionally, overwriting a scratch (or empty) PTE.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
---
 kernel/bpf/arena.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -131,15 +131,13 @@ static int apply_range_set_cb(pte_t *pte
 
 	if (!data)
 		return 0;
-	/* sanity check */
-	if (unlikely(!pte_none(ptep_get(pte))))
-		return -EBUSY;
 
 	page = d->pages[d->i];
 	/* paranoia, similar to vmap_pages_pte_range() */
 	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
 		return -EINVAL;
 
+	/* May be none or the scratch page, overwrite either way */
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
 	d->i++;
 	return 0;
Re: [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages
Posted by bot+bpf-ci@kernel.org 1 week ago
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 1727503b25d8..520224dc45c9 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -157,15 +157,13 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>
>  	if (!data)
>  		return 0;
> -	/* sanity check */
> -	if (unlikely(!pte_none(ptep_get(pte))))
> -		return -EBUSY;
>
>  	page = d->pages[d->i];
>  	/* paranoia, similar to vmap_pages_pte_range() */
>  	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
>  		return -EINVAL;
>
> +	/* May be none or the scratch page, overwrite either way */
>  	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
                ^^^^^^^^^^

When set_pte_at() overwrites a present PTE (the scratch page) with a
different PFN (the new real page), does the arena allocation path flush
the TLB?

The caller arena_alloc_pages() only calls flush_vmap_cache() after
apply_to_page_range(), which expands to flush_cache_vmap() (a no-op on
x86). This flushes data caches but not TLB entries.

Before this patch, apply_range_set_cb() only transitioned not-present to
present, which requires no TLB flush. But now it can transition
present(scratch) to present(real), changing the underlying PFN.

When a BPF program faults on an unallocated arena address, the
kernel-fault recovery path installs the scratch page into that PTE:

bpf_arena_handle_page_fault()
  -> apply_range_set_scratch_cb()
     -> ptep_try_set()

The faulting CPU then retries the access, loading a TLB entry that maps
the arena kern_vm address to the scratch_page PFN. This entry becomes
stale when a later arena_alloc_pages() overwrites the PTE to point to a
real page.

The arena kern_vm region is mapped in init_mm and shared across all CPUs.
Without a TLB flush, can a CPU holding a stale entry continue
reading/writing the shared scratch page instead of the new real page?

The free path in arena_free_pages() and arena_free_worker() already calls
flush_tlb_kernel_range() after clearing arena PTEs:

arena_free_pages():
    ptep_get_and_clear(...)
    flush_tlb_kernel_range(start, end);

arena_free_worker():
    apply_to_page_range(...)
    flush_tlb_kernel_range(...)

Does the alloc path need a similar flush when overwriting a present PTE?


>  	d->i++;
>  	return 0;

---

> commit 066af9e73183b42ad5ee27805163ce25053ed9bf
> Author: Tejun Heo <tj@kernel.org>
>
>     bpf: Overwrite scratch PTE when allocating arena pages
>
>     apply_range_set_cb() maps the pages for a new arena allocation. It returned
>     -EBUSY when the target PTE was already populated.
>
>     Kernel-fault recovery can leave the per-arena scratch page in an otherwise
>     unallocated arena PTE: a BPF program that touched an unallocated address
>     gets the scratch page installed there. A later bpf_arena_alloc_pages()
>     covering that page then finds the PTE populated, returns -EBUSY, and leaves
>     the scratch page in place. Every subsequent allocation of that page fails
>     the same way.
>
>     Drop the must-be-empty check so set_pte_at() installs the real page
>     unconditionally, overwriting a scratch (or empty) PTE.
>
>     Suggested-by: Alexei Starovoitov <ast@kernel.org>
>     Signed-off-by: Tejun Heo <tj@kernel.org>
>     Cc: David Hildenbrand <david@kernel.org>

This looks like a bug fix for the scratch page allocation failure.
Should this include:

  Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26718992256