[v2] mm/virtio: skip redundant zeroing of host-zeroed reported pages

[PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Michael S. Tsirkin 1 month, 3 weeks ago


v2 - this is an attempt to address David Hildenbrand's comments:
overloading GFP and using page->private, support for
balloon deflate.

I hope this one is acceptable, API wise.

I also went ahead and implemented an alternative approach
that David suggested:
using GFP_ZERO to zero userspace pages.
The issue is simple: on some architectures, one has to know the
userspace fault address in order to flush the cache.

So, I had to propagate the fault address everywhere.
A lot of churn, and my concern is, if we miss even one
place, silent, subtle data corruption will result and only
on some arches (x86 will be fine).

Still, you can view that approach here:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero

David, if you still feel I should switch to that approach,
let me know. Personally, I'd rather keep that as a separate
project from this optimization.


Still an RFC as virtio bits need work, but I would very much like
to get a general agreement on mm bits first. Thanks!

Patch 1 is a minor
optimization that I am carrying here to avoid conflicts. It
might make sense to merge it straight away.

-------



When a guest reports free pages to the hypervisor via virtio-balloon's
free page reporting, the host typically zeros those pages when reclaiming
their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
When the guest later reallocates those pages, the kernel zeros them
again -- redundantly.

This series eliminates that double-zeroing by propagating the "host
already zeroed this page" information through the buddy allocator and
into the page fault path.

Performance with THP enabled on a 2GB VM, 1 vCPU, allocating
256MB of anonymous pages:

  metric         baseline        optimized       delta
  task-clock     191 +- 31 ms    60 +- 35 ms     -68%
  cache-misses   1.10M +- 460K   269K +- 31K     -76%
  instructions   4.54M +- 275K   4.10M +- 130K   -10%

With hugetlb surplus pages:

  metric         baseline        optimized       delta
  task-clock     183 +- 24 ms    45 +- 23 ms     -76%
  cache-misses   1.27M +- 544K   270K +- 16K     -79%
  instructions   5.37M +- 254K   4.94M +- 155K   -8%

Notes:
- The virtio_balloon module parameter (15/18) is a testing hack.
  A proper virtio feature flag is needed before merging.
- Patch 16/18 adds a sysfs flush trigger for deterministic testing
  (avoids waiting for the 2-second reporting delay).
- When host_zeroes_pages is set, callers skip folio_zero_user() for
  pages known to be zeroed by the host. This is safe on all
  architectures because the hypervisor invalidates guest cache lines
  when reclaiming page backing (MADV_DONTNEED).
- PG_zeroed is aliased to PG_private. It is excluded from
  PAGE_FLAGS_CHECK_AT_PREP because it must survive on free-list pages
  until post_alloc_hook() consumes and clears it. Is this acceptable,
  or should a different bit be used?
- The optimization is most effective with THP, where entire 2MB
  pages are allocated directly from reported order-9+ buddy pages.
  Without THP, only ~21% of order-0 allocations come from reported
  pages due to low-order fragmentation.
- Persistent hugetlb pool pages are not covered: when freed by
  userspace they return to the hugetlb free pool, not the buddy
  allocator, so they are never reported to the host.  Surplus
  hugetlb pages are allocated from buddy and do benefit.

Test program:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>

  #ifndef MADV_POPULATE_WRITE
  #define MADV_POPULATE_WRITE 23
  #endif
  #ifndef MAP_HUGETLB
  #define MAP_HUGETLB 0x40000
  #endif

  int main(int argc, char **argv)
  {
      unsigned long size;
      int flags = MAP_PRIVATE | MAP_ANONYMOUS;
      void *p;
      int r;

      if (argc < 2) {
          fprintf(stderr, "usage: %s <size_mb> [huge]\n", argv[0]);
          return 1;
      }
      size = atol(argv[1]) * 1024UL * 1024;
      if (argc >= 3 && strcmp(argv[2], "huge") == 0)
          flags |= MAP_HUGETLB;
      p = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
      if (p == MAP_FAILED) {
          perror("mmap");
          return 1;
      }
      r = madvise(p, size, MADV_POPULATE_WRITE);
      if (r) {
          perror("madvise");
          return 1;
      }
      munmap(p, size);
      return 0;
  }

Test script (bench.sh):

  #!/bin/bash
  # Usage: bench.sh <size_mb> <mode> <iterations> [huge]
  # mode 0 = baseline, mode 1 = skip zeroing
  SZ=${1:-256}; MODE=${2:-0}; ITER=${3:-10}; HUGE=${4:-}
  FLUSH=/sys/module/page_reporting/parameters/flush
  PERF_DATA=/tmp/perf-$MODE.csv
  rmmod virtio_balloon 2>/dev/null
  insmod virtio_balloon.ko host_zeroes_pages=$MODE
  echo 512 > $FLUSH
  [ "$HUGE" = "huge" ] && echo $((SZ/2)) > /proc/sys/vm/nr_overcommit_hugepages
  rm -f $PERF_DATA
  echo "=== sz=${SZ}MB mode=$MODE iter=$ITER $HUGE ==="
  for i in $(seq 1 $ITER); do
      echo 3 > /proc/sys/vm/drop_caches
      echo 512 > $FLUSH
      perf stat -e task-clock,instructions,cache-misses \
          -x, -o $PERF_DATA --append -- ./alloc_once $SZ $HUGE
  done
  [ "$HUGE" = "huge" ] && echo 0 > /proc/sys/vm/nr_overcommit_hugepages
  rmmod virtio_balloon
  awk -F, '/^#/||/^$/{next}{v=$1+0;e=$3;gsub(/ /,"",e);s[e]+=v;n[e]++}
  END{for(e in s)printf "  %-16s %10.2f (n=%d)\n",e,s[e]/n[e],n[e]}' $PERF_DATA

Compile and run:
  gcc -static -O2 -o alloc_once alloc_once.c
  bash bench.sh 256 0 10          # baseline (regular pages)
  bash bench.sh 256 1 10          # optimized (regular pages)
  bash bench.sh 256 0 10 huge     # baseline (hugetlb surplus)
  bash bench.sh 256 1 10 huge     # optimized (hugetlb surplus)

Changes since v1:
- Replaced __GFP_PREZEROED with PG_zeroed page flag (aliased PG_private)
- Added pghint_t type and vma_alloc_folio_hints() API
- Track PG_zeroed across buddy merges and splits
- Added post_alloc_hook integration (single consume/clear point)
- Added hugetlb support (pool pages + memfd)
- Added page_reporting flush parameter for deterministic testing
- Added free_frozen_pages_hint/put_page_hint for balloon deflate path
- Added try_to_claim_block PG_zeroed preservation
- Updated perf numbers with per-iteration flush methodology

Michael S. Tsirkin (18):
  mm: page_alloc: propagate PageReported flag across buddy splits
  mm: add pghint_t type and vma_alloc_folio_hints API
  mm: add PG_zeroed page flag for known-zero pages
  mm: page_alloc: track PG_zeroed across buddy merges
  mm: page_alloc: preserve PG_zeroed in try_to_claim_block
  mm: page_alloc: thread pghint_t through get_page_from_freelist
  mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t
  mm: hugetlb: thread pghint_t through buddy allocation chain
  mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing
  mm: page_reporting: support host-zeroed reported pages
  mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed
    pages


Michael S. Tsirkin (18):
  mm: page_alloc: propagate PageReported flag across buddy splits
  mm: add pghint_t type and vma_alloc_folio_hints API
  mm: add PG_zeroed page flag for known-zero pages
  mm: page_alloc: track PG_zeroed across buddy merges
  mm: page_alloc: preserve PG_zeroed in try_to_claim_block
  mm: page_alloc: thread pghint_t through get_page_from_freelist
  mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t
  mm: hugetlb: thread pghint_t through buddy allocation chain
  mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing
  mm: page_reporting: support host-zeroed reported pages
  mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed
    pages
  mm: skip zeroing in alloc_anon_folio for pre-zeroed pages
  mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages
  mm: memfd: skip zeroing for pre-zeroed hugetlb pages
  virtio_balloon: add host_zeroes_pages module parameter
  mm: page_reporting: add flush parameter with page budget
  mm: add free_frozen_pages_hint and put_page_hint APIs
  virtio_balloon: mark deflated pages as pre-zeroed

 drivers/virtio/virtio_balloon.c |  11 ++-
 fs/hugetlbfs/inode.c            |   5 +-
 include/linux/gfp.h             |  17 +++++
 include/linux/highmem.h         |   6 +-
 include/linux/hugetlb.h         |   6 +-
 include/linux/mm.h              |  12 +++
 include/linux/page-flags.h      |  13 +++-
 include/linux/page_reporting.h  |   3 +
 mm/compaction.c                 |   4 +-
 mm/huge_memory.c                |  12 +--
 mm/hugetlb.c                    |  52 +++++++++----
 mm/internal.h                   |   7 +-
 mm/memfd.c                      |  12 +--
 mm/memory.c                     |  14 ++--
 mm/mempolicy.c                  |  85 +++++++++++++++++++++
 mm/page_alloc.c                 | 131 ++++++++++++++++++++++++--------
 mm/page_reporting.c             |  55 +++++++++++++-
 mm/page_reporting.h             |  11 +++
 mm/swap.c                       |  19 +++++
 19 files changed, 392 insertions(+), 83 deletions(-)

-- 
MST

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Gregory Price 1 month, 3 weeks ago

On Mon, Apr 20, 2026 at 08:51:13AM -0400, Michael S. Tsirkin wrote:
> 
> When a guest reports free pages to the hypervisor via virtio-balloon's
> free page reporting, the host typically zeros those pages when reclaiming
> their backing memory (e.g., via MADV_DONTNEED on anonymous mappings).
> When the guest later reallocates those pages, the kernel zeros them
> again -- redundantly.
>

It took me a second to really wrap my head around what you were saying
here, but if i'm following correctly:

  1) Guest steals a page, reports the free page to the host
  2) Host returns that page to the buddy
  3) Guest wants the page back -> vmexit, alloc()
      a) host gets a page from the buddy via fault path
      b) this memory is "user memory" so host zeroes the page
  4) Guest repeats step 3, re-zeoring the page

So you're adding a step that does:

  1) page_reporting_drain() in guest sets PG_zeroed if host_zeroes_pages=true
  2) on allocation, if PG_zeroed is set, don't zero

In theory this seems ok.  PG_zeroed being a buddy-only flag is nice.

In practice there are obvious concerns about an explicit flag that would
allow a kernel (in this case the guest) to skip zeroing a page destined
for userland mappings - but i'm also paranoid.

In concept this seems reasonable, in implementation I have concerns
about the pghint_t type being added. Will respond inline in David's
reply thread on that though where you already have notes.

~Gregory

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 4/20/26 14:51, Michael S. Tsirkin wrote:
> 

Hi!

> 
> v2 - this is an attempt to address David Hildenbrand's comments:
> overloading GFP and using page->private, support for
> balloon deflate.
> 
> I hope this one is acceptable, API wise.
> 
> I also went ahead and implemented an alternative approach
> that David suggested:
> using GFP_ZERO to zero userspace pages.
> The issue is simple: on some architectures, one has to know the
> userspace fault address in order to flush the cache.
> 
> So, I had to propagate the fault address everywhere.

As I said, that might not be necessary. vma_alloc_folio() is the
interface we mostly care about in that regard.

> A lot of churn, and my concern is, if we miss even one
> place, silent, subtle data corruption will result and only
> on some arches (x86 will be fine).

Which would *already* be the case of you use folio_alloc(GFP_ZERO)
instead of magical vma_alloc_folio() + folio_zero_user().

I don't really see how vma_alloc_folio_hints() -- that also consumes the
address -- is any better in that regard?

When we just do the right thing with vma_alloc_folio(GFP_ZERO), at least
vma_alloc_folio() users will not accidentally do the wrong thing by
forgetting to use folio_zero_user().

> 
> Still, you can view that approach here:
> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero
> 
> David, if you still feel I should switch to that approach,
> let me know. Personally, I'd rather keep that as a separate
> project from this optimization.
I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us.

But let's hear other opinions first.

-- 
Cheers,

David

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Michael S. Tsirkin 1 month, 3 weeks ago

On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> On 4/20/26 14:51, Michael S. Tsirkin wrote:
> > 
> 
> Hi!
> 
> > 
> > v2 - this is an attempt to address David Hildenbrand's comments:
> > overloading GFP and using page->private, support for
> > balloon deflate.
> > 
> > I hope this one is acceptable, API wise.
> > 
> > I also went ahead and implemented an alternative approach
> > that David suggested:
> > using GFP_ZERO to zero userspace pages.
> > The issue is simple: on some architectures, one has to know the
> > userspace fault address in order to flush the cache.
> > 
> > So, I had to propagate the fault address everywhere.
> 
> As I said, that might not be necessary. vma_alloc_folio() is the
> interface we mostly care about in that regard.
>

I'm not sure I follow what "might not be necessary". We need a fault
address so zeroing can be effective wrt cache. Since you asked that it's
done deep in post alloc hook, the address has to propagate all over mm.

> > A lot of churn, and my concern is, if we miss even one
> > place, silent, subtle data corruption will result and only
> > on some arches (x86 will be fine).
> 
> Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> instead of magical vma_alloc_folio() + folio_zero_user().
> 
> I don't really see how vma_alloc_folio_hints() -- that also consumes the
> address -- is any better in that regard?

By itself, it is not. But the issue is propagating the address from
there all over mm. If we miss even one place - we get a subtle cache
corruption on non x86.

hints are exactly that - if we forget to set them, all that happens
is that we do an extra zeroing. That is all.

> When we just do the right thing with vma_alloc_folio(GFP_ZERO), at least
> vma_alloc_folio() users will not accidentally do the wrong thing by
> forgetting to use folio_zero_user().

Well, it's simply that
1. if you plain forget folio_zero_user you get non zero on all arches
2. we *already* have folio_zero_user in place

> > 
> > Still, you can view that approach here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero
> > 
> > David, if you still feel I should switch to that approach,
> > let me know. Personally, I'd rather keep that as a separate
> > project from this optimization.
> I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us.

Pls take a look at that tree then. What do you think of that approach?
Better? If you want it in form of patches, I can post them
in private or on list.

Let me know, I don't have a problem with that approach - I tested
it and the performance is the same.  But the issue is that there's lot
of paths that have to propagate the fault address. It took me a while to
even find them all (assuming I found them all).

I also note that we need a flag for free in order to implement
balloon deflate as you asked. Here, I reused the hints.

> But let's hear other opinions first.
> 
> -- 
> Cheers,
> 
> David

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 4/21/26 01:33, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/20/26 14:51, Michael S. Tsirkin wrote:
>>>
>>
>> Hi!
>>
>>>
>>> v2 - this is an attempt to address David Hildenbrand's comments:
>>> overloading GFP and using page->private, support for
>>> balloon deflate.
>>>
>>> I hope this one is acceptable, API wise.
>>>
>>> I also went ahead and implemented an alternative approach
>>> that David suggested:
>>> using GFP_ZERO to zero userspace pages.
>>> The issue is simple: on some architectures, one has to know the
>>> userspace fault address in order to flush the cache.
>>>
>>> So, I had to propagate the fault address everywhere.
>>
>> As I said, that might not be necessary. vma_alloc_folio() is the
>> interface we mostly care about in that regard.
>>
> 
> I'm not sure I follow what "might not be necessary". We need a fault
> address so zeroing can be effective wrt cache. Since you asked that it's
> done deep in post alloc hook, the address has to propagate all over mm.

Let's look at who ends up using user_alloc_needs_zeroing() or
folio_zero_user()

3 folio_zero_user() users are hugetlb that might get pages from another
allocator. In particular, mm/memfd.c even passes 0 as it doesn't even
have an address.

I don't think we particularly care about speeding up hugetlb zeroing at
this point when we already don't even care about optimizing for
user_alloc_needs_zeroing(). But it could be reworked later to optimize
zeroing in a similar way when actually allocating a folio from the buddy.

Now, for callers we care more about:

* vma_alloc_anon_folio_pmd() calls
  vma_alloc_folio()+user_alloc_needs_zeroing()+folio_zero_user()
* alloc_anon_folio() calls
  vma_alloc_folio()+user_alloc_needs_zeroing()+folio_zero_user()
* vma_alloc_zeroed_movable_folio() calls
  vma_alloc_zeroed_movable_folio()+user_alloc_needs_zeroing()+
  clear_user_highpage().

Other vma_alloc_folio() users neither specify __GFP_ZERO not use
folio_zero_user(), as they will be overwriting the data either way. Like
KSM when unsharing, for example.

I am saying we move "user_alloc_needs_zeroing()+folio_zero_user()" into
vma_alloc_folio(), by teaching vma_alloc_folio() to respect __GFP_ZERO.

user_alloc_needs_zeroing() will effectively go away as the buddy will
just handle that.

All of the above is what you do on gfp_zero branch already, so I think
you understood what I meant regarding this interface.

Anybody in the tree that would be using another folio_alloc() (or page
allocator) interface with __GFP_ZERO *would already be broken on other
architectures* where we would actually require folio_zero_user(), as
they would already not be using folio_zero_user().

But we don't really need the user address in many cases, like when
allocating a folio for the pagecache where we don't even have an address.

> 
> 
>>> A lot of churn, and my concern is, if we miss even one
>>> place, silent, subtle data corruption will result and only
>>> on some arches (x86 will be fine).
>>
>> Which would *already* be the case of you use folio_alloc(GFP_ZERO)
>> instead of magical vma_alloc_folio() + folio_zero_user().
>>
>> I don't really see how vma_alloc_folio_hints() -- that also consumes the
>> address -- is any better in that regard?
> 
> By itself, it is not. But the issue is propagating the address from
> there all over mm. If we miss even one place - we get a subtle cache
> corruption on non x86.

Yes. Like someone not using folio_zero_user() as of today.

[...]

> 
>>>
>>> Still, you can view that approach here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git gfp_zero
>>>
>>> David, if you still feel I should switch to that approach,
>>> let me know. Personally, I'd rather keep that as a separate
>>> project from this optimization.
>> I'd prefer if we extend vma_alloc_folio() to just handle GFP_ZERO for us.
> 
> Pls take a look at that tree then. What do you think of that approach?
> Better? 

I primarily wonder whether we can limit the impact in patch #1 by
focusing on the vma_alloc_folio() path only.

For example, I don't think converting all folio_alloc_mpol() users to
consume USER_ADDR_NONE at this point is really reasonable.

(a) Focus on vma_alloc_folio(), where we already have an address.

(b) To implement vma_alloc_folio() that way, we might need some internal
    interfaces that consume an address.

For example, instead of changing all callers of post_alloc_hook() to
pass USER_ADDR_NONE, can we make post_alloc_hook() a simple wrapper
around a variant that consumes an address.

So isn't there a way we can just keep the changes mostly to mm/page_alloc.c?

> 
> I also note that we need a flag for free in order to implement
> balloon deflate as you asked. Here, I reused the hints.

Yes, but on the allocation path we do have a flag: __GFP_ZERO.

-- 
Cheers,

David

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Gregory Price 1 month, 3 weeks ago

On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> > On 4/20/26 14:51, Michael S. Tsirkin wrote:
> 
> > > A lot of churn, and my concern is, if we miss even one
> > > place, silent, subtle data corruption will result and only
> > > on some arches (x86 will be fine).
> > 
> > Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> > instead of magical vma_alloc_folio() + folio_zero_user().
> > 
> > I don't really see how vma_alloc_folio_hints() -- that also consumes the
> > address -- is any better in that regard?
> 
> By itself, it is not. But the issue is propagating the address from
> there all over mm. If we miss even one place - we get a subtle cache
> corruption on non x86.
> 

Why does it need to propogate?

Can we leave folio_zero_user() callers the same, but add a PG_zeroed
check in folio_zero_user() that skips the zeroing (but not the cache
flush) and clear the PG_zeroed bit?

Is this feasible?

You don't eliminate the folio_zero_user(), but maybe we shouldn't?

(a bit naive here - i haven't checked the PG_zeroed lifetime, i did
 see it overloads PG_private - so this might not be feasible)

> 
> I also note that we need a flag for free in order to implement
> balloon deflate as you asked. Here, I reused the hints.
> 

I'd sooner just implement this as

   ___put_folio(folio, gfp_t)

   __put_folio(folio) { ___put_folio(folio, NULL); }

And change the free path to take overloaded gfp flags.

Some of the existing ones might even be useful as-is.

It's essentially the same thing, but prevents a bunch of churn and
saves us a new concept.

optional gfp flags on free seem like genuinely useful interface for
certain callers (definitely not all).

~Gregory

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Michael S. Tsirkin 1 month, 3 weeks ago

On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote:
> On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> > > On 4/20/26 14:51, Michael S. Tsirkin wrote:
> > 
> > > > A lot of churn, and my concern is, if we miss even one
> > > > place, silent, subtle data corruption will result and only
> > > > on some arches (x86 will be fine).
> > > 
> > > Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> > > instead of magical vma_alloc_folio() + folio_zero_user().
> > > 
> > > I don't really see how vma_alloc_folio_hints() -- that also consumes the
> > > address -- is any better in that regard?
> > 
> > By itself, it is not. But the issue is propagating the address from
> > there all over mm. If we miss even one place - we get a subtle cache
> > corruption on non x86.
> > 
> 
> Why does it need to propogate?
> 
> Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> check in folio_zero_user() that skips the zeroing (but not the cache
> flush) and clear the PG_zeroed bit?
> 
> Is this feasible?

I do not see how - this would require leaking the page flag out of the
buddy allocator.


> You don't eliminate the folio_zero_user(), but maybe we shouldn't?
> 
> (a bit naive here - i haven't checked the PG_zeroed lifetime, i did
>  see it overloads PG_private - so this might not be feasible)
> 
> > 
> > I also note that we need a flag for free in order to implement
> > balloon deflate as you asked. Here, I reused the hints.
> > 
> 
> I'd sooner just implement this as
> 
>    ___put_folio(folio, gfp_t)
> 
>    __put_folio(folio) { ___put_folio(folio, NULL); }
> 
> And change the free path to take overloaded gfp flags.
> Some of the existing ones might even be useful as-is.
> 
> It's essentially the same thing, but prevents a bunch of churn and
> saves us a new concept.
> 
> optional gfp flags on free seem like genuinely useful interface for
> certain callers (definitely not all).
> 
> ~Gregory

But we do not have a gfp_t flag meaning "this has been zeroed"
and when I proposed something similar in v1, David hated abusing
gfp flags for what is not an allocation property.

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Gregory Price 1 month, 3 weeks ago

On Tue, Apr 21, 2026 at 09:06:00AM -0400, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote:
> > 
> > Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> > check in folio_zero_user() that skips the zeroing (but not the cache
> > flush) and clear the PG_zeroed bit?
> > 
> > Is this feasible?
> 
> I do not see how - this would require leaking the page flag out of the
> buddy allocator.
>

Right, but you're leaking that bit of information out one way or another
- whether it's a page-flag or something else (pghint_t) you have the
same lifecycle problems (when does it become invalidated? how long can
it be trusted for?).

I suppose at least with (pghint_t) the data (in theory) falls out of
scope and doesn't live with the page - but guaranteed it just ends up
polluting more and more interfaces.

I'm seeing why David's suggest to plumb __GFP_ZERO correctly makes
sense, it's really the only feasible approach here that doesn't generate
a staleness problem with whatever information you try to leak out.

~Gregory

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Michael S. Tsirkin 1 month, 3 weeks ago

On Tue, Apr 21, 2026 at 12:51:00PM -0400, Gregory Price wrote:
> On Tue, Apr 21, 2026 at 09:06:00AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Apr 20, 2026 at 10:38:19PM -0400, Gregory Price wrote:
> > > 
> > > Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> > > check in folio_zero_user() that skips the zeroing (but not the cache
> > > flush) and clear the PG_zeroed bit?
> > > 
> > > Is this feasible?
> > 
> > I do not see how - this would require leaking the page flag out of the
> > buddy allocator.
> >
> 
> Right, but you're leaking that bit of information out one way or another
> - whether it's a page-flag or something else (pghint_t) you have the
> same lifecycle problems (when does it become invalidated? how long can
> it be trusted for?).
>
> I suppose at least with (pghint_t) the data (in theory) falls out of
> scope and doesn't live with the page - but guaranteed it just ends up
> polluting more and more interfaces.
> 
> I'm seeing why David's suggest to plumb __GFP_ZERO correctly makes
> sense, it's really the only feasible approach here that doesn't generate
> a staleness problem with whatever information you try to leak out.
> 
> ~Gregory

OK, v3 with that incoming.

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 4/21/26 04:38, Gregory Price wrote:
> On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
>> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
>>
>>>
>>> Which would *already* be the case of you use folio_alloc(GFP_ZERO)
>>> instead of magical vma_alloc_folio() + folio_zero_user().
>>>
>>> I don't really see how vma_alloc_folio_hints() -- that also consumes the
>>> address -- is any better in that regard?
>>
>> By itself, it is not. But the issue is propagating the address from
>> there all over mm. If we miss even one place - we get a subtle cache
>> corruption on non x86.
>>
> 
> Why does it need to propogate?
> 
> Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> check in folio_zero_user() that skips the zeroing (but not the cache
> flush) and clear the PG_zeroed bit?

folio_zero_user() is just an abomination, really.

-- 
Cheers,

David

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by Michael S. Tsirkin 1 month, 3 weeks ago

On Tue, Apr 21, 2026 at 12:04:49PM +0200, David Hildenbrand (Arm) wrote:
> On 4/21/26 04:38, Gregory Price wrote:
> > On Mon, Apr 20, 2026 at 07:33:38PM -0400, Michael S. Tsirkin wrote:
> >> On Mon, Apr 20, 2026 at 08:20:57PM +0200, David Hildenbrand (Arm) wrote:
> >>
> >>>
> >>> Which would *already* be the case of you use folio_alloc(GFP_ZERO)
> >>> instead of magical vma_alloc_folio() + folio_zero_user().
> >>>
> >>> I don't really see how vma_alloc_folio_hints() -- that also consumes the
> >>> address -- is any better in that regard?
> >>
> >> By itself, it is not. But the issue is propagating the address from
> >> there all over mm. If we miss even one place - we get a subtle cache
> >> corruption on non x86.
> >>
> > 
> > Why does it need to propogate?
> > 
> > Can we leave folio_zero_user() callers the same, but add a PG_zeroed
> > check in folio_zero_user() that skips the zeroing (but not the cache
> > flush) and clear the PG_zeroed bit?
> 
> folio_zero_user() is just an abomination, really.

We can't completely replace it with GFP_ZERO though e.g. because hugetlbfs
has its own pool and needs to zero that.

> -- 
> Cheers,
> 
> David

Re: [PATCH RFC v2 00/18] mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by David Hildenbrand (Arm) 1 month, 3 weeks ago

On 4/21/26 16:15, Michael S. Tsirkin wrote:
> On Tue, Apr 21, 2026 at 12:04:49PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/21/26 04:38, Gregory Price wrote:
>>>
>>> Why does it need to propogate?
>>>
>>> Can we leave folio_zero_user() callers the same, but add a PG_zeroed
>>> check in folio_zero_user() that skips the zeroing (but not the cache
>>> flush) and clear the PG_zeroed bit?
>>
>> folio_zero_user() is just an abomination, really.
> 
> We can't completely replace it with GFP_ZERO though e.g. because hugetlbfs
> has its own pool and needs to zero that.

Right, hugetlb will have to keep using it for now.

-- 
Cheers,

David

[syzbot ci] Re: mm/virtio: skip redundant zeroing of host-zeroed reported pages

Posted by syzbot ci 1 month, 3 weeks ago

syzbot ci has tested the following series

[v2] mm/virtio: skip redundant zeroing of host-zeroed reported pages
https://lore.kernel.org/all/cover.1776689093.git.mst@redhat.com
* [PATCH RFC v2 01/18] mm: page_alloc: propagate PageReported flag across buddy splits
* [PATCH RFC v2 02/18] mm: add pghint_t type and vma_alloc_folio_hints API
* [PATCH RFC v2 03/18] mm: add PG_zeroed page flag for known-zero pages
* [PATCH RFC v2 04/18] mm: page_alloc: track PG_zeroed across buddy merges
* [PATCH RFC v2 05/18] mm: page_alloc: preserve PG_zeroed in try_to_claim_block
* [PATCH RFC v2 06/18] mm: page_alloc: thread pghint_t through get_page_from_freelist
* [PATCH RFC v2 07/18] mm: post_alloc_hook: use PG_zeroed to skip zeroing, return pghint_t
* [PATCH RFC v2 08/18] mm: hugetlb: thread pghint_t through buddy allocation chain
* [PATCH RFC v2 09/18] mm: hugetlb: use PG_zeroed for pool pages, skip redundant zeroing
* [PATCH RFC v2 10/18] mm: page_reporting: support host-zeroed reported pages
* [PATCH RFC v2 11/18] mm: skip zeroing in vma_alloc_zeroed_movable_folio for pre-zeroed pages
* [PATCH RFC v2 12/18] mm: skip zeroing in alloc_anon_folio for pre-zeroed pages
* [PATCH RFC v2 13/18] mm: skip zeroing in vma_alloc_anon_folio_pmd for pre-zeroed pages
* [PATCH RFC v2 14/18] mm: memfd: skip zeroing for pre-zeroed hugetlb pages
* [PATCH RFC v2 15/18] virtio_balloon: add host_zeroes_pages module parameter
* [PATCH RFC v2 16/18] mm: page_reporting: add flush parameter with page budget
* [PATCH RFC v2 17/18] mm: add free_frozen_pages_hint and put_page_hint APIs
* [PATCH RFC v2 18/18] virtio_balloon: mark deflated pages as pre-zeroed

and found the following issue:
kernel BUG in free_huge_folio

Full report is available here:
https://ci.syzbot.org/series/329d9cff-a0ad-46d2-8ff4-d9f4341a611f

***

kernel BUG in free_huge_folio

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      b8a5774cd49996e8ef83b1637a9b547158f18de9
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/60e99a0e-08bf-474f-b034-a8bfd2eb90b0/config
syz repro: https://ci.syzbot.org/findings/2868ce13-1752-4f9f-9aa9-c5ce89f01fc7/syz_repro

 ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
page_owner free stack trace missing
------------[ cut here ]------------
kernel BUG at ./include/linux/page-flags.h:698!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6015 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__ClearPageZeroed include/linux/page-flags.h:698 [inline]
RIP: 0010:free_huge_folio+0xf93/0x12e0 mm/hugetlb.c:1749
Code: c7 c6 a0 64 db 8b e8 5c 9b fe fe 90 0f 0b e8 74 40 9c ff eb 05 e8 6d 40 9c ff 48 89 df 48 c7 c6 e0 63 db 8b e8 3e 9b fe fe 90 <0f> 0b e8 56 40 9c ff 48 89 df 48 c7 c6 40 64 db 8b e8 27 9b fe fe
RSP: 0018:ffffc90003a675b8 EFLAGS: 00010246
RAX: c71fb9abd148e700 RBX: ffffea0005808000 RCX: 0000000000000000
RDX: 0000000000000007 RSI: ffffffff8defcd3f RDI: 00000000ffffffff
RBP: 1ffffd4000b0101a R08: ffffffff9011ddb7 R09: 1ffffffff2023bb6
R10: dffffc0000000000 R11: fffffbfff2023bb7 R12: ffffea0005808008
R13: ffffea00058080d0 R14: ffffffff9a2e27c0 R15: 0000000000000040
FS:  00007fe11abed6c0(0000) GS:ffff8882a9453000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe119de9f00 CR3: 00000001ba914000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __folio_put+0xfc/0x4f0 mm/swap.c:105
 hugetlb_mfill_atomic_pte+0x130a/0x1730 mm/hugetlb.c:6294
 mfill_atomic_hugetlb mm/userfaultfd.c:601 [inline]
 mfill_atomic mm/userfaultfd.c:773 [inline]
 mfill_atomic_copy+0xe28/0x1420 mm/userfaultfd.c:872
 userfaultfd_copy fs/userfaultfd.c:1642 [inline]
 userfaultfd_ioctl+0x2c17/0x5130 fs/userfaultfd.c:2059
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe119d9c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe11abed028 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fe11a015fa0 RCX: 00007fe119d9c819
RDX: 00002000000000c0 RSI: 00000000c028aa03 RDI: 0000000000000003
RBP: 00007fe119e32c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fe11a016038 R14: 00007fe11a015fa0 R15: 00007ffe7cd6ce28
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__ClearPageZeroed include/linux/page-flags.h:698 [inline]
RIP: 0010:free_huge_folio+0xf93/0x12e0 mm/hugetlb.c:1749
Code: c7 c6 a0 64 db 8b e8 5c 9b fe fe 90 0f 0b e8 74 40 9c ff eb 05 e8 6d 40 9c ff 48 89 df 48 c7 c6 e0 63 db 8b e8 3e 9b fe fe 90 <0f> 0b e8 56 40 9c ff 48 89 df 48 c7 c6 40 64 db 8b e8 27 9b fe fe
RSP: 0018:ffffc90003a675b8 EFLAGS: 00010246
RAX: c71fb9abd148e700 RBX: ffffea0005808000 RCX: 0000000000000000
RDX: 0000000000000007 RSI: ffffffff8defcd3f RDI: 00000000ffffffff
RBP: 1ffffd4000b0101a R08: ffffffff9011ddb7 R09: 1ffffffff2023bb6
R10: dffffc0000000000 R11: fffffbfff2023bb7 R12: ffffea0005808008
R13: ffffea00058080d0 R14: ffffffff9a2e27c0 R15: 0000000000000040
FS:  00007fe11abed6c0(0000) GS:ffff8882a9453000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe119de9f00 CR3: 00000001ba914000 CR4: 00000000000006f0


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.