drivers/virtio/virtio_balloon.c | 11 ++- fs/hugetlbfs/inode.c | 5 +- include/linux/gfp.h | 17 +++++ include/linux/highmem.h | 6 +- include/linux/hugetlb.h | 6 +- include/linux/mm.h | 12 +++ include/linux/page-flags.h | 13 +++- include/linux/page_reporting.h | 3 + mm/compaction.c | 4 +- mm/huge_memory.c | 12 +-- mm/hugetlb.c | 52 +++++++++---- mm/internal.h | 7 +- mm/memfd.c | 12 +-- mm/memory.c | 14 ++-- mm/mempolicy.c | 85 +++++++++++++++++++++ mm/page_alloc.c | 131 ++++++++++++++++++++++++-------- mm/page_reporting.c | 55 +++++++++++++- mm/page_reporting.h | 11 +++ mm/swap.c | 19 +++++ 19 files changed, 392 insertions(+), 83 deletions(-)
v2 - this is an attempt to address David Hildenbrand's comments:
overloading GFP and using page->private, support for
balloon deflate.
I hope this one is acceptable, API wise.
I also went ahead and implemented an alternative approach
that David suggested:
using GFP_ZERO to zero userspace pages.
The issue is simple: on some architectures, one has to know the
userspace fault address in order to flush the cache.
So, I had to propagate the fault address everywhere.
A lot of churn, and my concern is, if we miss even one
place, silent, subtle data corruption will result and only
on some arches (x86 will be fine).
Still, you can view that approach here:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero
David, if you still feel I should switch to that approach,
let me know. Personally, I'd rather keep that as a separate
project from this optimization.
Still an RFC as virtio bits need work, but I would very much like
to get a general agreement on mm bits first. Thanks!
Patch 1 is a minor
optimization that I am carrying here to avoid conflicts. It
might make sense to merge it straight away.
-------
When a guest reports free pages to the hypervisor via virtio-balloon's
free page reporting, the host typically zeros those pages when reclaiming
their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
When the guest later reallocates those pages, the kernel zeros them
again -- redundantly.
This series eliminates that double-zeroing by propagating the "host
already zeroed this page" information through the buddy allocator and
into the page fault path.
Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:
metric baseline optimized delta
task-clock 191 +- 31 ms 60 +- 35 ms -68%
cache-misses 1.10M +- 460K 269K +- 31K -76%
instructions 4.54M +- 275K 4.10M +- 130K -10%
With hugetlb surplus pages:
metric baseline optimized delta
task-clock 183 +- 24 ms 45 +- 23 ms -76%
cache-misses 1.27M +- 544K 270K +- 16K -79%
instructions 5.37M +- 254K 4.94M +- 155K -8%
Notes:
- The virtio_balloon module parameter (15/18) is a testing hack.
A proper virtio feature flag is needed before merging.
- Patch 16/18 adds a sysfs flush trigger for deterministic testing
(avoids waiting for the 2-second reporting delay).
- When host_zeroes_pages is set, callers skip folio_zero_user() for
pages known to be zeroed by the host. This is safe on all
architectures because the hypervisor invalidates guest cache lines
when reclaiming page backing (MADV_DONTNEED).
- PG_zeroed is aliased to PG_private. It is excluded from
PAGE_FLAGS_CHECK_AT_PREP because it must survive on free-list pages
until post_alloc_hook() consumes and clears it. Is this acceptable,
or should a different bit be used?
- The optimization is most effective with THP, where entire 2MB
pages are allocated directly from reported order-9+ buddy pages.
Without THP, only ~21% of order-0 allocations come from reported
pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
userspace they return to the hugetlb free pool, not the buddy
allocator, so they are never reported to the host. Surplus
hugetlb pages are allocated from buddy and do benefit.
Test program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#ifndef MADV_POPULATE_WRITE
#define MADV_POPULATE_WRITE 23
#endif
#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
int main(int argc, char **argv)
{
unsigned long size;
int flags = MAP_PRIVATE | MAP_ANONYMOUS;
void *p;
int r;
if (argc < 2) {
fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
return 1;
}
size = atol(argv[1]) * 1024UL * 1024;
if (argc >= 3 && strcmp(argv[2], "huge") == 0)
flags |= MAP_HUGETLB;
p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return 1;
}
r = madvise(p, size, MADV_POPULATE_WRITE);
if (r) {
perror("madvise");
return 1;
}
munmap(p, size);
return 0;
}
Test script (bench.sh):
#!/bin/bash
# Usage: bench.sh <size_mb> <mode> <iterations> [huge]
# mode 0 = baseline, mode 1 = skip zeroing
SZ=${1:-256}; MODE=${2:-0}; ITER=${3:-10}; HUGE=${4:-}
FLUSH=/sys/module/page_reporting/parameters/flush
PERF_DATA=/tmp/perf-$MODE.csv
rmmod virtio_balloon 2>/dev/null
insmod virtio_balloon.ko host_zeroes_pages=$MODE
echo 512 > $FLUSH
[ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
rm -f $PERF_DATA
echo "=== sz=${SZ}MB mode=$MODE iter=$ITER $HUGE ==="
for i in $(seq 1 $ITER); do
echo 3 > /proc/sys/vm/drop_caches
echo 512 > $FLUSH
perf stat -e task-clock,instructions,cache-misses \
-x, -o $PERF_DATA --append -- ./alloc_once $SZ $HUGE
done
[ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
rmmod virtio_balloon
awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;n[e]++}
END{for(e in s)printf " %-16s %10.2f (n=%d)\n",e,s[e]/n[e],n[e]}' $PERF_DATA
Compile and run:
gcc -static -O2 -o alloc_once alloc_once.c
bash bench.sh 256 0 10 # baseline (regular pages)
bash bench.sh 256 1 10 # optimized (regular pages)
bash bench.sh 256 0 10 huge # baseline (hugetlb surplus)
bash bench.sh 256 1 10 huge # optimized (hugetlb surplus)
Changes since v1:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology
Michael S. Tsirkin (18):
mm: page_alloc: propagate PageReported flag across buddy splits
mm: add pghint_t type and vma_alloc_folio_hints API
mm: add PG_zeroed page flag for known-zero pages
mm: page_alloc: track PG_zeroed across buddy merges
mm: page_alloc: preserve PG_zeroed in try_to_claim_block
mm: page_alloc: thread pghint_t through get_page_from_freelist
mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t
mm: hugetlb: thread pghint_t through buddy allocation chain
mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing
mm: page_reporting: support host-zeroed reported pages
mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed
pages
Michael S. Tsirkin (18):
mm: page_alloc: propagate PageReported flag across buddy splits
mm: add pghint_t type and vma_alloc_folio_hints API
mm: add PG_zeroed page flag for known-zero pages
mm: page_alloc: track PG_zeroed across buddy merges
mm: page_alloc: preserve PG_zeroed in try_to_claim_block
mm: page_alloc: thread pghint_t through get_page_from_freelist
mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t
mm: hugetlb: thread pghint_t through buddy allocation chain
mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing
mm: page_reporting: support host-zeroed reported pages
mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed
pages
mm: skip zeroing in alloc_anon_folio for pre-zeroed pages
mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages
mm: memfd: skip zeroing for pre-zeroed hugetlb pages
virtio_balloon: add host_zeroes_pages module parameter
mm: page_reporting: add flush parameter with page budget
mm: add free_frozen_pages_hint and put_page_hint APIs
virtio_balloon: mark deflated pages as pre-zeroed
drivers/virtio/virtio_balloon.c | 11 ++-
fs/hugetlbfs/inode.c | 5 +-
include/linux/gfp.h | 17 +++++
include/linux/highmem.h | 6 +-
include/linux/hugetlb.h | 6 +-
include/linux/mm.h | 12 +++
include/linux/page-flags.h | 13 +++-
include/linux/page_reporting.h | 3 +
mm/compaction.c | 4 +-
mm/huge_memory.c | 12 +--
mm/hugetlb.c | 52 +++++++++----
mm/internal.h | 7 +-
mm/memfd.c | 12 +--
mm/memory.c | 14 ++--
mm/mempolicy.c | 85 +++++++++++++++++++++
mm/page_alloc.c | 131 ++++++++++++++++++++++++--------
mm/page_reporting.c | 55 +++++++++++++-
mm/page_reporting.h | 11 +++
mm/swap.c | 19 +++++
19 files changed, 392 insertions(+), 83 deletions(-)
--
MST
On Mon, Apr 20, 2026 at 08:51:13AM -0400, Michael S. Tsirkin wrote:
>
> When a guest reports free pages to the hypervisor via virtio-balloon's
> free page reporting, the host typically zeros those pages when reclaiming
> their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
> When the guest later reallocates those pages, the kernel zeros them
> again -- redundantly.
>
It took me a second to really wrap my head around what you were saying
here, but if i'm following correctly:
1) Guest steals a page, reports the free page to the host
2) Host returns that page to the buddy
3) Guest wants the page back -> vmexit, alloc()
a) host gets a page from the buddy via fault path
b) this memory is "user memory" so host zeroes the page
4) Guest repeats step 3, re-zeoring the page
So you're adding a step that does:
1) page_reporting_drain() in guest sets PG_zeroed if host_zeroes_pages=true
2) on allocation, if PG_zeroed is set, don't zero
In theory this seems ok. PG_zeroed being a buddy-only flag is nice.
In practice there are obvious concerns about an explicit flag that would
allow a kernel (in this case the guest) to skip zeroing a page destined
for userland mappings - but i'm also paranoid.
In concept this seems reasonable, in implementation I have concerns
about the pghint_t type being added. Will respond inline in David's
reply thread on that though where you already have notes.
~Gregory
On 4/20/26 14:51, Michael S. Tsirkin wrote: > Hi! > > v2 - this is an attempt to address David Hildenbrand's comments: > overloading GFP and using page->private, support for > balloon deflate. > > I hope this one is acceptable, API wise. > > I also went ahead and implemented an alternative approach > that David suggested: > using GFP_ZERO to zero userspace pages. > The issue is simple: on some architectures, one has to know the > userspace fault address in order to flush the cache. > > So, I had to propagate the fault address everywhere. As I said, that might not be necessary. vma_alloc_folio() is the interface we mostly care about in that regard. > A lot of churn, and my concern is, if we miss even one > place, silent, subtle data corruption will result and only > on some arches (x86 will be fine). Which would *already* be the case of you use folio_alloc(GFP_ZERO) instead of magical vma_alloc_folio() + folio_zero_user(). I don't really see how vma_alloc_folio_hints() -- that also consumes the address -- is any better in that regard? When we just do the right thing with vma_alloc_folio(GFP_ZERO), at least vma_alloc_folio() users will not accidentally do the wrong thing by forgetting to use folio_zero_user(). > > Still, you can view that approach here: > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero > > David, if you still feel I should switch to that approach, > let me know. Personally, I'd rather keep that as a separate > project from this optimization. I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us. But let's hear other opinions first. -- Cheers, David
On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote: > On 4/20/26 14:51, Michael S. Tsirkin wrote: > > > > Hi! > > > > > v2 - this is an attempt to address David Hildenbrand's comments: > > overloading GFP and using page->private, support for > > balloon deflate. > > > > I hope this one is acceptable, API wise. > > > > I also went ahead and implemented an alternative approach > > that David suggested: > > using GFP_ZERO to zero userspace pages. > > The issue is simple: on some architectures, one has to know the > > userspace fault address in order to flush the cache. > > > > So, I had to propagate the fault address everywhere. > > As I said, that might not be necessary. vma_alloc_folio() is the > interface we mostly care about in that regard. > I'm not sure I follow what "might not be necessary". We need a fault address so zeroing can be effective wrt cache. Since you asked that it's done deep in post alloc hook, the address has to propagate all over mm. > > A lot of churn, and my concern is, if we miss even one > > place, silent, subtle data corruption will result and only > > on some arches (x86 will be fine). > > Which would *already* be the case of you use folio_alloc(GFP_ZERO) > instead of magical vma_alloc_folio() + folio_zero_user(). > > I don't really see how vma_alloc_folio_hints() -- that also consumes the > address -- is any better in that regard? By itself, it is not. But the issue is propagating the address from there all over mm. If we miss even one place - we get a subtle cache corruption on non x86. hints are exactly that - if we forget to set them, all that happens is that we do an extra zeroing. That is all. > When we just do the right thing with vma_alloc_folio(GFP_ZERO), at least > vma_alloc_folio() users will not accidentally do the wrong thing by > forgetting to use folio_zero_user(). Well, it's simply that 1. if you plain forget folio_zero_user you get non zero on all arches 2. we *already* have folio_zero_user in place > > > > Still, you can view that approach here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero > > > > David, if you still feel I should switch to that approach, > > let me know. Personally, I'd rather keep that as a separate > > project from this optimization. > I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us. Pls take a look at that tree then. What do you think of that approach? Better? If you want it in form of patches, I can post them in private or on list. Let me know, I don't have a problem with that approach - I tested it and the performance is the same. But the issue is that there's lot of paths that have to propagate the fault address. It took me a while to even find them all (assuming I found them all). I also note that we need a flag for free in order to implement balloon deflate as you asked. Here, I reused the hints. > But let's hear other opinions first. > > -- > Cheers, > > David
On 4/21/26 01:33, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/20/26 14:51, Michael S. Tsirkin wrote:
>>>
>>
>> Hi!
>>
>>>
>>> v2 - this is an attempt to address David Hildenbrand's comments:
>>> overloading GFP and using page->private, support for
>>> balloon deflate.
>>>
>>> I hope this one is acceptable, API wise.
>>>
>>> I also went ahead and implemented an alternative approach
>>> that David suggested:
>>> using GFP_ZERO to zero userspace pages.
>>> The issue is simple: on some architectures, one has to know the
>>> userspace fault address in order to flush the cache.
>>>
>>> So, I had to propagate the fault address everywhere.
>>
>> As I said, that might not be necessary. vma_alloc_folio() is the
>> interface we mostly care about in that regard.
>>
>
> I'm not sure I follow what "might not be necessary". We need a fault
> address so zeroing can be effective wrt cache. Since you asked that it's
> done deep in post alloc hook, the address has to propagate all over mm.
Let's look at who ends up using user_alloc_needs_zeroing() or
folio_zero_user()
3 folio_zero_user() users are hugetlb that might get pages from another
allocator. In particular, mm/memfd.c even passes 0 as it doesn't even
have an address.
I don't think we particularly care about speeding up hugetlb zeroing at
this point when we already don't even care about optimizing for
user_alloc_needs_zeroing(). But it could be reworked later to optimize
zeroing in a similar way when actually allocating a folio from the buddy.
Now, for callers we care more about:
* vma_alloc_anon_folio_pmd() calls
vma_alloc_folio()+user_alloc_needs_zeroing()+folio_zero_user()
* alloc_anon_folio() calls
vma_alloc_folio()+user_alloc_needs_zeroing()+folio_zero_user()
* vma_alloc_zeroed_movable_folio() calls
vma_alloc_zeroed_movable_folio()+user_alloc_needs_zeroing()+
clear_user_highpage().
Other vma_alloc_folio() users neither specify __GFP_ZERO not use
folio_zero_user(), as they will be overwriting the data either way. Like
KSM when unsharing, for example.
I am saying we move "user_alloc_needs_zeroing()+folio_zero_user()" into
vma_alloc_folio(), by teaching vma_alloc_folio() to respect __GFP_ZERO.
user_alloc_needs_zeroing() will effectively go away as the buddy will
just handle that.
All of the above is what you do on gfp_zero branch already, so I think
you understood what I meant regarding this interface.
Anybody in the tree that would be using another folio_alloc() (or page
allocator) interface with __GFP_ZERO *would already be broken on other
architectures* where we would actually require folio_zero_user(), as
they would already not be using folio_zero_user().
But we don't really need the user address in many cases, like when
allocating a folio for the pagecache where we don't even have an address.
>
>
>>> A lot of churn, and my concern is, if we miss even one
>>> place, silent, subtle data corruption will result and only
>>> on some arches (x86 will be fine).
>>
>> Which would *already* be the case of you use folio_alloc(GFP_ZERO)
>> instead of magical vma_alloc_folio() + folio_zero_user().
>>
>> I don't really see how vma_alloc_folio_hints() -- that also consumes the
>> address -- is any better in that regard?
>
> By itself, it is not. But the issue is propagating the address from
> there all over mm. If we miss even one place - we get a subtle cache
> corruption on non x86.
Yes. Like someone not using folio_zero_user() as of today.
[...]
>
>>>
>>> Still, you can view that approach here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero
>>>
>>> David, if you still feel I should switch to that approach,
>>> let me know. Personally, I'd rather keep that as a separate
>>> project from this optimization.
>> I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us.
>
> Pls take a look at that tree then. What do you think of that approach?
> Better?
I primarily wonder whether we can limit the impact in patch #1 by
focusing on the vma_alloc_folio() path only.
For example, I don't think converting all folio_alloc_mpol() users to
consume USER_ADDR_NONE at this point is really reasonable.
(a) Focus on vma_alloc_folio(), where we already have an address.
(b) To implement vma_alloc_folio() that way, we might need some internal
interfaces that consume an address.
For example, instead of changing all callers of post_alloc_hook() to
pass USER_ADDR_NONE, can we make post_alloc_hook() a simple wrapper
around a variant that consumes an address.
So isn't there a way we can just keep the changes mostly to mm/page_alloc.c?
>
> I also note that we need a flag for free in order to implement
> balloon deflate as you asked. Here, I reused the hints.
Yes, but on the allocation path we do have a flag: __GFP_ZERO.
--
Cheers,
David
On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> > On 4/20/26 14:51, Michael S. Tsirkin wrote:
>
> > > A lot of churn, and my concern is, if we miss even one
> > > place, silent, subtle data corruption will result and only
> > > on some arches (x86 will be fine).
> >
> > Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> > instead of magical vma_alloc_folio() + folio_zero_user().
> >
> > I don't really see how vma_alloc_folio_hints() -- that also consumes the
> > address -- is any better in that regard?
>
> By itself, it is not. But the issue is propagating the address from
> there all over mm. If we miss even one place - we get a subtle cache
> corruption on non x86.
>
Why does it need to propogate?
Can we leave folio_zero_user() callers the same, but add a PG_zeroed
check in folio_zero_user() that skips the zeroing (but not the cache
flush) and clear the PG_zeroed bit?
Is this feasible?
You don't eliminate the folio_zero_user(), but maybe we shouldn't?
(a bit naive here - i haven't checked the PG_zeroed lifetime, i did
see it overloads PG_private - so this might not be feasible)
>
> I also note that we need a flag for free in order to implement
> balloon deflate as you asked. Here, I reused the hints.
>
I'd sooner just implement this as
___put_folio(folio, gfp_t)
__put_folio(folio) { ___put_folio(folio, NULL); }
And change the free path to take overloaded gfp flags.
Some of the existing ones might even be useful as-is.
It's essentially the same thing, but prevents a bunch of churn and
saves us a new concept.
optional gfp flags on free seem like genuinely useful interface for
certain callers (definitely not all).
~Gregory
On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote:
> On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> > > On 4/20/26 14:51, Michael S. Tsirkin wrote:
> >
> > > > A lot of churn, and my concern is, if we miss even one
> > > > place, silent, subtle data corruption will result and only
> > > > on some arches (x86 will be fine).
> > >
> > > Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> > > instead of magical vma_alloc_folio() + folio_zero_user().
> > >
> > > I don't really see how vma_alloc_folio_hints() -- that also consumes the
> > > address -- is any better in that regard?
> >
> > By itself, it is not. But the issue is propagating the address from
> > there all over mm. If we miss even one place - we get a subtle cache
> > corruption on non x86.
> >
>
> Why does it need to propogate?
>
> Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> check in folio_zero_user() that skips the zeroing (but not the cache
> flush) and clear the PG_zeroed bit?
>
> Is this feasible?
I do not see how - this would require leaking the page flag out of the
buddy allocator.
> You don't eliminate the folio_zero_user(), but maybe we shouldn't?
>
> (a bit naive here - i haven't checked the PG_zeroed lifetime, i did
> see it overloads PG_private - so this might not be feasible)
>
> >
> > I also note that we need a flag for free in order to implement
> > balloon deflate as you asked. Here, I reused the hints.
> >
>
> I'd sooner just implement this as
>
> ___put_folio(folio, gfp_t)
>
> __put_folio(folio) { ___put_folio(folio, NULL); }
>
> And change the free path to take overloaded gfp flags.
> Some of the existing ones might even be useful as-is.
>
> It's essentially the same thing, but prevents a bunch of churn and
> saves us a new concept.
>
> optional gfp flags on free seem like genuinely useful interface for
> certain callers (definitely not all).
>
> ~Gregory
But we do not have a gfp_t flag meaning "this has been zeroed"
and when I proposed something similar in v1, David hated abusing
gfp flags for what is not an allocation property.
On Tue, Apr 21, 2026 at 09:06:00AM -0400, Michael S. Tsirkin wrote: > On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote: > > > > Can we leave folio_zero_user() callers the same, but add a PG_zeroed > > check in folio_zero_user() that skips the zeroing (but not the cache > > flush) and clear the PG_zeroed bit? > > > > Is this feasible? > > I do not see how - this would require leaking the page flag out of the > buddy allocator. > Right, but you're leaking that bit of information out one way or another - whether it's a page-flag or something else (pghint_t) you have the same lifecycle problems (when does it become invalidated? how long can it be trusted for?). I suppose at least with (pghint_t) the data (in theory) falls out of scope and doesn't live with the page - but guaranteed it just ends up polluting more and more interfaces. I'm seeing why David's suggest to plumb __GFP_ZERO correctly makes sense, it's really the only feasible approach here that doesn't generate a staleness problem with whatever information you try to leak out. ~Gregory
On Tue, Apr 21, 2026 at 12:51:00PM -0400, Gregory Price wrote: > On Tue, Apr 21, 2026 at 09:06:00AM -0400, Michael S. Tsirkin wrote: > > On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote: > > > > > > Can we leave folio_zero_user() callers the same, but add a PG_zeroed > > > check in folio_zero_user() that skips the zeroing (but not the cache > > > flush) and clear the PG_zeroed bit? > > > > > > Is this feasible? > > > > I do not see how - this would require leaking the page flag out of the > > buddy allocator. > > > > Right, but you're leaking that bit of information out one way or another > - whether it's a page-flag or something else (pghint_t) you have the > same lifecycle problems (when does it become invalidated? how long can > it be trusted for?). > > I suppose at least with (pghint_t) the data (in theory) falls out of > scope and doesn't live with the page - but guaranteed it just ends up > polluting more and more interfaces. > > I'm seeing why David's suggest to plumb __GFP_ZERO correctly makes > sense, it's really the only feasible approach here that doesn't generate > a staleness problem with whatever information you try to leak out. > > ~Gregory OK, v3 with that incoming.
On 4/21/26 04:38, Gregory Price wrote: > On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote: >> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote: >> >>> >>> Which would *already* be the case of you use folio_alloc(GFP_ZERO) >>> instead of magical vma_alloc_folio() + folio_zero_user(). >>> >>> I don't really see how vma_alloc_folio_hints() -- that also consumes the >>> address -- is any better in that regard? >> >> By itself, it is not. But the issue is propagating the address from >> there all over mm. If we miss even one place - we get a subtle cache >> corruption on non x86. >> > > Why does it need to propogate? > > Can we leave folio_zero_user() callers the same, but add a PG_zeroed > check in folio_zero_user() that skips the zeroing (but not the cache > flush) and clear the PG_zeroed bit? folio_zero_user() is just an abomination, really. -- Cheers, David
On Tue, Apr 21, 2026 at 12:04:49PM +0200, David Hildenbrand (Arm) wrote: > On 4/21/26 04:38, Gregory Price wrote: > > On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote: > >> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote: > >> > >>> > >>> Which would *already* be the case of you use folio_alloc(GFP_ZERO) > >>> instead of magical vma_alloc_folio() + folio_zero_user(). > >>> > >>> I don't really see how vma_alloc_folio_hints() -- that also consumes the > >>> address -- is any better in that regard? > >> > >> By itself, it is not. But the issue is propagating the address from > >> there all over mm. If we miss even one place - we get a subtle cache > >> corruption on non x86. > >> > > > > Why does it need to propogate? > > > > Can we leave folio_zero_user() callers the same, but add a PG_zeroed > > check in folio_zero_user() that skips the zeroing (but not the cache > > flush) and clear the PG_zeroed bit? > > folio_zero_user() is just an abomination, really. We can't completely replace it with GFP_ZERO though e.g. because hugetlbfs has its own pool and needs to zero that. > -- > Cheers, > > David
On 4/21/26 16:15, Michael S. Tsirkin wrote: > On Tue, Apr 21, 2026 at 12:04:49PM +0200, David Hildenbrand (Arm) wrote: >> On 4/21/26 04:38, Gregory Price wrote: >>> >>> Why does it need to propogate? >>> >>> Can we leave folio_zero_user() callers the same, but add a PG_zeroed >>> check in folio_zero_user() that skips the zeroing (but not the cache >>> flush) and clear the PG_zeroed bit? >> >> folio_zero_user() is just an abomination, really. > > We can't completely replace it with GFP_ZERO though e.g. because hugetlbfs > has its own pool and needs to zero that. Right, hugetlb will have to keep using it for now. -- Cheers, David
syzbot ci has tested the following series [v2] mm/virtio: skip redundant zeroing of host-zeroed reported pages https://lore.kernel.org/all/cover.1776689093.git.mst@redhat.com * [PATCH RFC v2 01/18] mm: page_alloc: propagate PageReported flag across buddy splits * [PATCH RFC v2 02/18] mm: add pghint_t type and vma_alloc_folio_hints API * [PATCH RFC v2 03/18] mm: add PG_zeroed page flag for known-zero pages * [PATCH RFC v2 04/18] mm: page_alloc: track PG_zeroed across buddy merges * [PATCH RFC v2 05/18] mm: page_alloc: preserve PG_zeroed in try_to_claim_block * [PATCH RFC v2 06/18] mm: page_alloc: thread pghint_t through get_page_from_freelist * [PATCH RFC v2 07/18] mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t * [PATCH RFC v2 08/18] mm: hugetlb: thread pghint_t through buddy allocation chain * [PATCH RFC v2 09/18] mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing * [PATCH RFC v2 10/18] mm: page_reporting: support host-zeroed reported pages * [PATCH RFC v2 11/18] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages * [PATCH RFC v2 12/18] mm: skip zeroing in alloc_anon_folio for pre-zeroed pages * [PATCH RFC v2 13/18] mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages * [PATCH RFC v2 14/18] mm: memfd: skip zeroing for pre-zeroed hugetlb pages * [PATCH RFC v2 15/18] virtio_balloon: add host_zeroes_pages module parameter * [PATCH RFC v2 16/18] mm: page_reporting: add flush parameter with page budget * [PATCH RFC v2 17/18] mm: add free_frozen_pages_hint and put_page_hint APIs * [PATCH RFC v2 18/18] virtio_balloon: mark deflated pages as pre-zeroed and found the following issue: kernel BUG in free_huge_folio Full report is available here: https://ci.syzbot.org/series/329d9cff-a0ad-46d2-8ff4-d9f4341a611f *** kernel BUG in free_huge_folio tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: b8a5774cd49996e8ef83b1637a9b547158f18de9 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/60e99a0e-08bf-474f-b034-a8bfd2eb90b0/config syz repro: https://ci.syzbot.org/findings/2868ce13-1752-4f9f-9aa9-c5ce89f01fc7/syz_repro ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 page_owner free stack trace missing ------------[ cut here ]------------ kernel BUG at ./include/linux/page-flags.h:698! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 6015 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__ClearPageZeroed include/linux/page-flags.h:698 [inline] RIP: 0010:free_huge_folio+0xf93/0x12e0 mm/hugetlb.c:1749 Code: c7 c6 a0 64 db 8b e8 5c 9b fe fe 90 0f 0b e8 74 40 9c ff eb 05 e8 6d 40 9c ff 48 89 df 48 c7 c6 e0 63 db 8b e8 3e 9b fe fe 90 <0f> 0b e8 56 40 9c ff 48 89 df 48 c7 c6 40 64 db 8b e8 27 9b fe fe RSP: 0018:ffffc90003a675b8 EFLAGS: 00010246 RAX: c71fb9abd148e700 RBX: ffffea0005808000 RCX: 0000000000000000 RDX: 0000000000000007 RSI: ffffffff8defcd3f RDI: 00000000ffffffff RBP: 1ffffd4000b0101a R08: ffffffff9011ddb7 R09: 1ffffffff2023bb6 R10: dffffc0000000000 R11: fffffbfff2023bb7 R12: ffffea0005808008 R13: ffffea00058080d0 R14: ffffffff9a2e27c0 R15: 0000000000000040 FS: 00007fe11abed6c0(0000) GS:ffff8882a9453000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe119de9f00 CR3: 00000001ba914000 CR4: 00000000000006f0 Call Trace: <TASK> __folio_put+0xfc/0x4f0 mm/swap.c:105 hugetlb_mfill_atomic_pte+0x130a/0x1730 mm/hugetlb.c:6294 mfill_atomic_hugetlb mm/userfaultfd.c:601 [inline] mfill_atomic mm/userfaultfd.c:773 [inline] mfill_atomic_copy+0xe28/0x1420 mm/userfaultfd.c:872 userfaultfd_copy fs/userfaultfd.c:1642 [inline] userfaultfd_ioctl+0x2c17/0x5130 fs/userfaultfd.c:2059 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fe119d9c819 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fe11abed028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fe11a015fa0 RCX: 00007fe119d9c819 RDX: 00002000000000c0 RSI: 00000000c028aa03 RDI: 0000000000000003 RBP: 00007fe119e32c91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fe11a016038 R14: 00007fe11a015fa0 R15: 00007ffe7cd6ce28 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__ClearPageZeroed include/linux/page-flags.h:698 [inline] RIP: 0010:free_huge_folio+0xf93/0x12e0 mm/hugetlb.c:1749 Code: c7 c6 a0 64 db 8b e8 5c 9b fe fe 90 0f 0b e8 74 40 9c ff eb 05 e8 6d 40 9c ff 48 89 df 48 c7 c6 e0 63 db 8b e8 3e 9b fe fe 90 <0f> 0b e8 56 40 9c ff 48 89 df 48 c7 c6 40 64 db 8b e8 27 9b fe fe RSP: 0018:ffffc90003a675b8 EFLAGS: 00010246 RAX: c71fb9abd148e700 RBX: ffffea0005808000 RCX: 0000000000000000 RDX: 0000000000000007 RSI: ffffffff8defcd3f RDI: 00000000ffffffff RBP: 1ffffd4000b0101a R08: ffffffff9011ddb7 R09: 1ffffffff2023bb6 R10: dffffc0000000000 R11: fffffbfff2023bb7 R12: ffffea0005808008 R13: ffffea00058080d0 R14: ffffffff9a2e27c0 R15: 0000000000000040 FS: 00007fe11abed6c0(0000) GS:ffff8882a9453000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe119de9f00 CR3: 00000001ba914000 CR4: 00000000000006f0 *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. To test a patch for this bug, please reply with `#syz test` (should be on a separate line). The patch should be attached to the email. Note: arguments like custom git repos and branches are not supported.
© 2016 - 2026 Red Hat, Inc.