[v2] Introduce a huge-page pre-zeroing mechanism

[PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 1 month ago

This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 250
milliseconds for a 1G page on a Skylake machine).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application starts (such as a VM backed
by these pages), rendering the launch noticeably slow.

On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
pages takes about 16 seconds, roughly 250 ms per page. Even with
Ankur’s optimizations[2], the time drops only to ~13 seconds,
~200 ms per page, still a noticeable delay.

To accelerate the above scenario, this patchset exports a per-node,
read-write "zeroable_hugepages" sysfs interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.

This mechanism offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

Tested on the same Skylake platform as above, when the 64 GiB of memory
was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
latency test completed in negligible time.

In user space, we can use system calls such as epoll and write to zero
huge folios as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads (each running thread_fun()) that wait for huge pages on
node 0 to become eligible for zeroing; whenever such pages are available,
the threads clear them in parallel.

  static void thread_fun(void)
  {
  	epoll_create();
  	epoll_ctl();
  	while (1) {
  		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		if (val > 0)
  			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		epoll_wait();
  	}
  }
  
  static void start_pre_zero_thread(int thread_num)
  {
  	create_pre_zero_threads(thread_num, thread_fun)
  }
  
  int main(void)
  {
  	start_pre_zero_thread(8);
  }

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
[2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u

Li Zhe (8):
  mm/hugetlb: add pre-zeroed framework
  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  mm/hugetlb: move the huge folio to the end of the list during enqueue
  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  mm/hugetlb: relocate the per-hstate struct kobject pointer
  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  mm/hugetlb: limit event generation frequency of function
    do_zero_free_notify()

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 +++++
 mm/hugetlb.c            | 131 ++++++++++++++++++++++---
 mm/hugetlb_internal.h   |   6 ++
 mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 337 insertions(+), 35 deletions(-)

---
Changelogs:

v1->v2 :
- Use guard() to simplify function hpage_wait_zeroing(). (pointed by
  Raghu)
- Simplify the logic of zero_free_hugepages_nid() by removing
  redundant checks and exiting the loop upon encountering a
  pre-zeroed folio. (pointed by Frank)
- Include in the cover letter a performance comparison with Ankur's
  optimization patch[2]. (pointed by Andrew)

v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/

-- 
2.20.1

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Muchun Song 4 weeks, 1 day ago


> On Jan 7, 2026, at 19:31, Li Zhe <lizhe.67@bytedance.com> wrote:
> 
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").

I’d like you to add a brief summary here that roughly explains
what concerns the previous attempts raised and whether the
current proposal has already addressed those concerns, so more
people can quickly grasp the context.

> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
> 
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.

I did see some comments in [1] about QEMU supporting user-mode
parallel zero-page operations; I’m just not sure what the current
state of that support looks like, or what the corresponding benchmark
numbers are.

> 
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
> 
> This mechanism offers the following advantages:
> 
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
> 
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
> 
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
> 
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
> 
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
> 
> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
> 
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.
> 
>  static void thread_fun(void)
>  {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		if (val > 0)
>   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		epoll_wait();
>   	}
>  }
> 
>  static void start_pre_zero_thread(int thread_num)
>  {
>   	create_pre_zero_threads(thread_num, thread_fun)
>  }
> 
>  int main(void)
>  {
>   	start_pre_zero_thread(8);
>  }
> 
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
> [2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u
> 
> Li Zhe (8):
>  mm/hugetlb: add pre-zeroed framework
>  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
>  mm/hugetlb: move the huge folio to the end of the list during enqueue
>  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
>  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
>  mm/hugetlb: relocate the per-hstate struct kobject pointer
>  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
>  mm/hugetlb: limit event generation frequency of function
>    do_zero_free_notify()
> 
> fs/hugetlbfs/inode.c    |   3 +-
> include/linux/hugetlb.h |  26 +++++
> mm/hugetlb.c            | 131 ++++++++++++++++++++++---
> mm/hugetlb_internal.h   |   6 ++
> mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
> 5 files changed, 337 insertions(+), 35 deletions(-)
> 
> ---
> Changelogs:
> 
> v1->v2 :
> - Use guard() to simplify function hpage_wait_zeroing(). (pointed by
>  Raghu)
> - Simplify the logic of zero_free_hugepages_nid() by removing
>  redundant checks and exiting the loop upon encountering a
>  pre-zeroed folio. (pointed by Frank)
> - Include in the cover letter a performance comparison with Ankur's
>  optimization patch[2]. (pointed by Andrew)
> 
> v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/
> 
> -- 
> 2.20.1

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 4 days ago

On Fri, 9 Jan 2026 14:05:01 +0800, muchun.song@linux.dev wrote:

> > On Jan 7, 2026, at 19:31, Li Zhe <lizhe.67@bytedance.com> wrote:
> > 
> > This patchset is based on this commit[1]("mm/hugetlb: optionally
> > pre-zero hugetlb pages").
> 
> I’d like you to add a brief summary here that roughly explains
> what concerns the previous attempts raised and whether the
> current proposal has already addressed those concerns, so more
> people can quickly grasp the context.

In my opinion, the main concerns raised in the preceding discussion[1]
may be summarized as follows:

(1): The CPU cost of background zeroing is not attributable to the
task that consumes the pages, breaking fairness and cgroup accounting.

(2) Policy (when, how many threads) is hard-coded in the kernel. User
space lacks adequate means of control.

(3) Comparable functionality is already available in user space. (QEMU
support parallel preallocation)

(4) Faster zeroing method is provied in kernel[2].

In my view, these concerns have already been addressed by this patchset.

It merely supplies the tools and leaves all policy decisions to user
space; the kernel just performs the zeroing on behalf of the user,
thereby resolving concerns (1) and (2).

Regarding concern (3), I am aware that QEMU has implemented a parallel
page-touch mechanism, which does reduce VM creation time; nevertheless,
in our measurements it still consumes a non-trivial amount of time.
(According to feedback from QEMU colleagues, bringing up a 2 TB VM
still requires more than 40 seconds for zeroing)

> > Fresh hugetlb pages are zeroed out when they are faulted in,
> > just like with all other page types. This can take up a good
> > amount of time for larger page sizes (e.g. around 250
> > milliseconds for a 1G page on a Skylake machine).
> > 
> > This normally isn't a problem, since hugetlb pages are typically
> > mapped by the application for a long time, and the initial
> > delay when touching them isn't much of an issue.
> > 
> > However, there are some use cases where a large number of hugetlb
> > pages are touched when an application starts (such as a VM backed
> > by these pages), rendering the launch noticeably slow.
> > 
> > On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> > pages takes about 16 seconds, roughly 250 ms per page. Even with
> > Ankur’s optimizations[2], the time drops only to ~13 seconds,
> > ~200 ms per page, still a noticeable delay.

As for concern (4), I believe it is orthogonal to this patchset, and
the cover letter already contains a performance comparison that
demonstrates the additional benefit.

> I did see some comments in [1] about QEMU supporting user-mode
> parallel zero-page operations; I’m just not sure what the current
> state of that support looks like, or what the corresponding benchmark
> numbers are.

As noted above, QEMU already employs a parallel page-touch mechanism,
yet the elapsed time remains noticeable. I am not deeply familiar with
QEMU; please correct me if I am mistaken.

> > To accelerate the above scenario, this patchset exports a per-node,
> > read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> > This interface reports how many hugepages on that node can currently
> > be pre-zeroed and allows user space to request that any integer number
> > in the range [0, max] be zeroed in a single operation.
> > 
> > This mechanism offers the following advantages:
> > 
> > (1) User space gains full control over when zeroing is triggered,
> > enabling it to minimize the impact on both CPU and cache utilization.
> > 
> > (2) Applications can spawn as many zeroing processes as they need,
> > enabling concurrent background zeroing.
> > 
> > (3) By binding the process to specific CPUs, users can confine zeroing
> > threads to cores that do not run latency-critical tasks, eliminating
> > interference.
> > 
> > (4) A zeroing process can be interrupted at any time through standard
> > signal mechanisms, allowing immediate cancellation.
> > 
> > (5) The CPU consumption incurred by zeroing can be throttled and contained
> > with cgroups, ensuring that the cost is not borne system-wide.
> > 
> > Tested on the same Skylake platform as above, when the 64 GiB of memory
> > was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> > latency test completed in negligible time.

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
[2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Ankur Arora 3 weeks, 4 days ago

Li Zhe <lizhe.67@bytedance.com> writes:

> On Fri, 9 Jan 2026 14:05:01 +0800, muchun.song@linux.dev wrote:
>
>> > On Jan 7, 2026, at 19:31, Li Zhe <lizhe.67@bytedance.com> wrote:
>> >
>> > This patchset is based on this commit[1]("mm/hugetlb: optionally
>> > pre-zero hugetlb pages").
>>
>> I’d like you to add a brief summary here that roughly explains
>> what concerns the previous attempts raised and whether the
>> current proposal has already addressed those concerns, so more
>> people can quickly grasp the context.
>
> In my opinion, the main concerns raised in the preceding discussion[1]
> may be summarized as follows:
>
> (1): The CPU cost of background zeroing is not attributable to the
> task that consumes the pages, breaking fairness and cgroup accounting.
>
> (2) Policy (when, how many threads) is hard-coded in the kernel. User
> space lacks adequate means of control.
>
> (3) Comparable functionality is already available in user space. (QEMU
> support parallel preallocation)
>
> (4) Faster zeroing method is provied in kernel[2].
>
> In my view, these concerns have already been addressed by this patchset.
>
> It merely supplies the tools and leaves all policy decisions to user
> space; the kernel just performs the zeroing on behalf of the user,
> thereby resolving concerns (1) and (2).
>
> Regarding concern (3), I am aware that QEMU has implemented a parallel
> page-touch mechanism, which does reduce VM creation time; nevertheless,
> in our measurements it still consumes a non-trivial amount of time.
> (According to feedback from QEMU colleagues, bringing up a 2 TB VM
> still requires more than 40 seconds for zeroing)
>
>> > Fresh hugetlb pages are zeroed out when they are faulted in,
>> > just like with all other page types. This can take up a good
>> > amount of time for larger page sizes (e.g. around 250
>> > milliseconds for a 1G page on a Skylake machine).
>> >
>> > This normally isn't a problem, since hugetlb pages are typically
>> > mapped by the application for a long time, and the initial
>> > delay when touching them isn't much of an issue.
>> >
>> > However, there are some use cases where a large number of hugetlb
>> > pages are touched when an application starts (such as a VM backed
>> > by these pages), rendering the launch noticeably slow.
>> >
>> > On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
>> > pages takes about 16 seconds, roughly 250 ms per page. Even with
>> > Ankur’s optimizations[2], the time drops only to ~13 seconds,
>> > ~200 ms per page, still a noticeable delay.
>
> As for concern (4), I believe it is orthogonal to this patchset, and
> the cover letter already contains a performance comparison that
> demonstrates the additional benefit.

That comparison isn't quite apples to apples though. In the fault
workoad above, you are looking at single threaded zeroing but
realistically clearing pages at VM init is multi-threaded (QEMU does
that as David describes).

Also Skylake has probably one of the slowest REP; STOS implementations
I've tried.

--
ankur

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 4 days ago

On Mon, 12 Jan 2026 14:00:23 -0800, ankur.a.arora@oracle.com wrote:

> > Regarding concern (3), I am aware that QEMU has implemented a parallel
> > page-touch mechanism, which does reduce VM creation time; nevertheless,
> > in our measurements it still consumes a non-trivial amount of time.
> > (According to feedback from QEMU colleagues, bringing up a 2 TB VM
> > still requires more than 40 seconds for zeroing)
> >
> >> > Fresh hugetlb pages are zeroed out when they are faulted in,
> >> > just like with all other page types. This can take up a good
> >> > amount of time for larger page sizes (e.g. around 250
> >> > milliseconds for a 1G page on a Skylake machine).
> >> >
> >> > This normally isn't a problem, since hugetlb pages are typically
> >> > mapped by the application for a long time, and the initial
> >> > delay when touching them isn't much of an issue.
> >> >
> >> > However, there are some use cases where a large number of hugetlb
> >> > pages are touched when an application starts (such as a VM backed
> >> > by these pages), rendering the launch noticeably slow.
> >> >
> >> > On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> >> > pages takes about 16 seconds, roughly 250 ms per page. Even with
> >> > Ankur's optimizations[2], the time drops only to ~13 seconds,
> >> > ~200 ms per page, still a noticeable delay.
> >
> > As for concern (4), I believe it is orthogonal to this patchset, and
> > the cover letter already contains a performance comparison that
> > demonstrates the additional benefit.
> 
> That comparison isn't quite apples to apples though. In the fault
> workoad above, you are looking at single threaded zeroing but
> realistically clearing pages at VM init is multi-threaded (QEMU does
> that as David describes).
> 
> Also Skylake has probably one of the slowest REP; STOS implementations
> I've tried.

Hi ankur, thanks for your reply.

The test above merely offers a straightforward comparison of
page-clearing speeds. Its sole purpose is to demonstrate that the
current zeroing phase remains excessively time-consuming.

Even with multi-threaded clearing(QEMU caps the number of concurrent
zeroing threads at 16), booting a 2-TB VM still spends over 40 seconds
on zeroing. Based on the single-threaded test results, it can be
reasonably inferred that even after the clear_page optimization
patches are merged, a substantial amount of time will still be spent
on page zeroing when bringing up a large-scale VM.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 4 days ago

> As for concern (4), I believe it is orthogonal to this patchset, and
> the cover letter already contains a performance comparison that
> demonstrates the additional benefit.
> 
>> I did see some comments in [1] about QEMU supporting user-mode
>> parallel zero-page operations; I’m just not sure what the current
>> state of that support looks like, or what the corresponding benchmark
>> numbers are.
> 
> As noted above, QEMU already employs a parallel page-touch mechanism,
> yet the elapsed time remains noticeable. I am not deeply familiar with
> QEMU; please correct me if I am mistaken.

I implemented some part of the parallel preallocation support in QEMU.

With QEMU, you can specify the number of threads and even specify the 
NUMA-placement of these threads. So you can pretty much fine-tune that 
for an environment.

You still pre-zero all hugetlb pages at VM startup time, just in 
parallel though. So you pay some price at APP startup time.

If you know that you will run such a VM (or something else) later, you 
could pre-zero the memory from user space by using a hugetlb-backed file 
and supplying that to QEMU as memory backend for the VM. Then, you can 
start your VM without any pre-zeroing.

I guess that approach should work universally. Of course, there are 
limitations, as you would have to know how much memory an app needs, and 
have a way to supply that memory in form of a file to that app.

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 4 days ago

On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:

> > As for concern (4), I believe it is orthogonal to this patchset, and
> > the cover letter already contains a performance comparison that
> > demonstrates the additional benefit.
> > 
> >> I did see some comments in [1] about QEMU supporting user-mode
> >> parallel zero-page operations; I'm just not sure what the current
> >> state of that support looks like, or what the corresponding benchmark
> >> numbers are.
> > 
> > As noted above, QEMU already employs a parallel page-touch mechanism,
> > yet the elapsed time remains noticeable. I am not deeply familiar with
> > QEMU; please correct me if I am mistaken.
> 
> I implemented some part of the parallel preallocation support in QEMU.
> 
> With QEMU, you can specify the number of threads and even specify the 
> NUMA-placement of these threads. So you can pretty much fine-tune that 
> for an environment.
> 
> You still pre-zero all hugetlb pages at VM startup time, just in 
> parallel though. So you pay some price at APP startup time.

Hi David,

Thank you for the comprehensive explanation.

You are absolutely correct: QEMU's parallel preallocation is performed
only during VM start-up. We submitted this patch series mainly
because we observed that, even with the existing parallel mechanism,
launching large-size VMs still incurs prohibitive delays. (Bringing up
a 2 TB VM still requires more than 40 seconds for zeroing)

> If you know that you will run such a VM (or something else) later, you 
> could pre-zero the memory from user space by using a hugetlb-backed file 
> and supplying that to QEMU as memory backend for the VM. Then, you can 
> start your VM without any pre-zeroing.
> 
> I guess that approach should work universally. Of course, there are 
> limitations, as you would have to know how much memory an app needs, and 
> have a way to supply that memory in form of a file to that app.

Regarding user-space pre-zeroing, I agree that it is feasible once the
VM's memory footprint is known. We evaluated this approach internally;
however, in production environments, it is almost impossible to predict
the exact amount of memory a VM will require.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 3 days ago

On 1/13/26 07:37, Li Zhe wrote:
> On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:
> 
>>> As for concern (4), I believe it is orthogonal to this patchset, and
>>> the cover letter already contains a performance comparison that
>>> demonstrates the additional benefit.
>>>
>>>> I did see some comments in [1] about QEMU supporting user-mode
>>>> parallel zero-page operations; I'm just not sure what the current
>>>> state of that support looks like, or what the corresponding benchmark
>>>> numbers are.
>>>
>>> As noted above, QEMU already employs a parallel page-touch mechanism,
>>> yet the elapsed time remains noticeable. I am not deeply familiar with
>>> QEMU; please correct me if I am mistaken.
>>
>> I implemented some part of the parallel preallocation support in QEMU.
>>
>> With QEMU, you can specify the number of threads and even specify the
>> NUMA-placement of these threads. So you can pretty much fine-tune that
>> for an environment.
>>
>> You still pre-zero all hugetlb pages at VM startup time, just in
>> parallel though. So you pay some price at APP startup time.
> 
> Hi David,
> 
> Thank you for the comprehensive explanation.
> 
> You are absolutely correct: QEMU's parallel preallocation is performed
> only during VM start-up. We submitted this patch series mainly
> because we observed that, even with the existing parallel mechanism,
> launching large-size VMs still incurs prohibitive delays. (Bringing up
> a 2 TB VM still requires more than 40 seconds for zeroing)
> 
>> If you know that you will run such a VM (or something else) later, you
>> could pre-zero the memory from user space by using a hugetlb-backed file
>> and supplying that to QEMU as memory backend for the VM. Then, you can
>> start your VM without any pre-zeroing.
>>
>> I guess that approach should work universally. Of course, there are
>> limitations, as you would have to know how much memory an app needs, and
>> have a way to supply that memory in form of a file to that app.
> 
> Regarding user-space pre-zeroing, I agree that it is feasible once the
> VM's memory footprint is known. We evaluated this approach internally;
> however, in production environments, it is almost impossible to predict
> the exact amount of memory a VM will require.

Of course, you could preallocate to the expected maximum and then 
truncate the file to the size you need :)

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 3 days ago

On Tue, 13 Jan 2026 11:15:29 +0100, david@kernel.org wrote:

> On 1/13/26 07:37, Li Zhe wrote:
> > On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:
> > 
> >>> As for concern (4), I believe it is orthogonal to this patchset, and
> >>> the cover letter already contains a performance comparison that
> >>> demonstrates the additional benefit.
> >>>
> >>>> I did see some comments in [1] about QEMU supporting user-mode
> >>>> parallel zero-page operations; I'm just not sure what the current
> >>>> state of that support looks like, or what the corresponding benchmark
> >>>> numbers are.
> >>>
> >>> As noted above, QEMU already employs a parallel page-touch mechanism,
> >>> yet the elapsed time remains noticeable. I am not deeply familiar with
> >>> QEMU; please correct me if I am mistaken.
> >>
> >> I implemented some part of the parallel preallocation support in QEMU.
> >>
> >> With QEMU, you can specify the number of threads and even specify the
> >> NUMA-placement of these threads. So you can pretty much fine-tune that
> >> for an environment.
> >>
> >> You still pre-zero all hugetlb pages at VM startup time, just in
> >> parallel though. So you pay some price at APP startup time.
> > 
> > Hi David,
> > 
> > Thank you for the comprehensive explanation.
> > 
> > You are absolutely correct: QEMU's parallel preallocation is performed
> > only during VM start-up. We submitted this patch series mainly
> > because we observed that, even with the existing parallel mechanism,
> > launching large-size VMs still incurs prohibitive delays. (Bringing up
> > a 2 TB VM still requires more than 40 seconds for zeroing)
> > 
> >> If you know that you will run such a VM (or something else) later, you
> >> could pre-zero the memory from user space by using a hugetlb-backed file
> >> and supplying that to QEMU as memory backend for the VM. Then, you can
> >> start your VM without any pre-zeroing.
> >>
> >> I guess that approach should work universally. Of course, there are
> >> limitations, as you would have to know how much memory an app needs, and
> >> have a way to supply that memory in form of a file to that app.
> > 
> > Regarding user-space pre-zeroing, I agree that it is feasible once the
> > VM's memory footprint is known. We evaluated this approach internally;
> > however, in production environments, it is almost impossible to predict
> > the exact amount of memory a VM will require.
> 
> Of course, you could preallocate to the expected maximum and then 
> truncate the file to the size you need :)

The solution you described seems similar to delegating hugepage
management to a userspace daemon. I haven't explored this approach
before, but it appears quite complex. Beyond ensuring secure memory
isolation between VMs, we would also need to handle scenarios where
the management daemon or the QEMU process crashes, which implies
implementing robust recovery and memory reclamation mechanisms. Do
you happen to have any documentation or references regarding
userspace hugepage management that I could look into? Compared to
the userspace approach, I wonder if implementing hugepage
pre-zeroing directly within the kernel would be a simpler and more
direct way to accelerate VM creation.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 2 days ago

On 1/13/26 13:41, Li Zhe wrote:
> On Tue, 13 Jan 2026 11:15:29 +0100, david@kernel.org wrote:
> 
>> On 1/13/26 07:37, Li Zhe wrote:
>>> On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:
>>>
>>>>> As for concern (4), I believe it is orthogonal to this patchset, and
>>>>> the cover letter already contains a performance comparison that
>>>>> demonstrates the additional benefit.
>>>>>
>>>>>> I did see some comments in [1] about QEMU supporting user-mode
>>>>>> parallel zero-page operations; I'm just not sure what the current
>>>>>> state of that support looks like, or what the corresponding benchmark
>>>>>> numbers are.
>>>>>
>>>>> As noted above, QEMU already employs a parallel page-touch mechanism,
>>>>> yet the elapsed time remains noticeable. I am not deeply familiar with
>>>>> QEMU; please correct me if I am mistaken.
>>>>
>>>> I implemented some part of the parallel preallocation support in QEMU.
>>>>
>>>> With QEMU, you can specify the number of threads and even specify the
>>>> NUMA-placement of these threads. So you can pretty much fine-tune that
>>>> for an environment.
>>>>
>>>> You still pre-zero all hugetlb pages at VM startup time, just in
>>>> parallel though. So you pay some price at APP startup time.
>>>
>>> Hi David,
>>>
>>> Thank you for the comprehensive explanation.
>>>
>>> You are absolutely correct: QEMU's parallel preallocation is performed
>>> only during VM start-up. We submitted this patch series mainly
>>> because we observed that, even with the existing parallel mechanism,
>>> launching large-size VMs still incurs prohibitive delays. (Bringing up
>>> a 2 TB VM still requires more than 40 seconds for zeroing)
>>>
>>>> If you know that you will run such a VM (or something else) later, you
>>>> could pre-zero the memory from user space by using a hugetlb-backed file
>>>> and supplying that to QEMU as memory backend for the VM. Then, you can
>>>> start your VM without any pre-zeroing.
>>>>
>>>> I guess that approach should work universally. Of course, there are
>>>> limitations, as you would have to know how much memory an app needs, and
>>>> have a way to supply that memory in form of a file to that app.
>>>
>>> Regarding user-space pre-zeroing, I agree that it is feasible once the
>>> VM's memory footprint is known. We evaluated this approach internally;
>>> however, in production environments, it is almost impossible to predict
>>> the exact amount of memory a VM will require.
>>
>> Of course, you could preallocate to the expected maximum and then
>> truncate the file to the size you need :)
> 
> The solution you described seems similar to delegating hugepage
> management to a userspace daemon. I haven't explored this approach
> before, but it appears quite complex. Beyond ensuring secure memory
> isolation between VMs, we would also need to handle scenarios where
> the management daemon or the QEMU process crashes, which implies
> implementing robust recovery and memory reclamation mechanisms. 

Yes, but I don't think that's particularly complicated. You have to 
remove the backing file, yes.

> Do
> you happen to have any documentation or references regarding
> userspace hugepage management that I could look into? 

Not really any documentation. I pretty much only know how QEMU+libvirt 
ends up using it :)

> Compared to
> the userspace approach, I wonder if implementing hugepage
> pre-zeroing directly within the kernel would be a simpler and more
> direct way to accelerate VM creation.

I mean, yes. I don't particularly enjoy user-space having to poll for 
pre-zeroing of pages ... it feels like an odd interface for something 
that is supposed to be simple.

I do understand the reasoning that "zeroing must be charged to 
somebody", and that using a kthread is a bit suboptimal as well.


Here is a thought: with "init_on_free", we charge zeroing of pages to 
whoever frees a page.

Can't we have a hugetlb mode where we zero hugetlb folios as they are 
getting freed back to the hugetlb allcoator? IOW, we charge it to 
whoever puts the last reference.

just a thought, maybe it was discussed before ...

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 2 days ago

On Wed, 14 Jan 2026 11:41:48 +0100, david@kernel.org wrote:
 
> On 1/13/26 13:41, Li Zhe wrote:
> > On Tue, 13 Jan 2026 11:15:29 +0100, david@kernel.org wrote:
> > 
> >> On 1/13/26 07:37, Li Zhe wrote:
> >>> On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:
> >>>
> >>>>> As for concern (4), I believe it is orthogonal to this patchset, and
> >>>>> the cover letter already contains a performance comparison that
> >>>>> demonstrates the additional benefit.
> >>>>>
> >>>>>> I did see some comments in [1] about QEMU supporting user-mode
> >>>>>> parallel zero-page operations; I'm just not sure what the current
> >>>>>> state of that support looks like, or what the corresponding benchmark
> >>>>>> numbers are.
> >>>>>
> >>>>> As noted above, QEMU already employs a parallel page-touch mechanism,
> >>>>> yet the elapsed time remains noticeable. I am not deeply familiar with
> >>>>> QEMU; please correct me if I am mistaken.
> >>>>
> >>>> I implemented some part of the parallel preallocation support in QEMU.
> >>>>
> >>>> With QEMU, you can specify the number of threads and even specify the
> >>>> NUMA-placement of these threads. So you can pretty much fine-tune that
> >>>> for an environment.
> >>>>
> >>>> You still pre-zero all hugetlb pages at VM startup time, just in
> >>>> parallel though. So you pay some price at APP startup time.
> >>>
> >>> Hi David,
> >>>
> >>> Thank you for the comprehensive explanation.
> >>>
> >>> You are absolutely correct: QEMU's parallel preallocation is performed
> >>> only during VM start-up. We submitted this patch series mainly
> >>> because we observed that, even with the existing parallel mechanism,
> >>> launching large-size VMs still incurs prohibitive delays. (Bringing up
> >>> a 2 TB VM still requires more than 40 seconds for zeroing)
> >>>
> >>>> If you know that you will run such a VM (or something else) later, you
> >>>> could pre-zero the memory from user space by using a hugetlb-backed file
> >>>> and supplying that to QEMU as memory backend for the VM. Then, you can
> >>>> start your VM without any pre-zeroing.
> >>>>
> >>>> I guess that approach should work universally. Of course, there are
> >>>> limitations, as you would have to know how much memory an app needs, and
> >>>> have a way to supply that memory in form of a file to that app.
> >>>
> >>> Regarding user-space pre-zeroing, I agree that it is feasible once the
> >>> VM's memory footprint is known. We evaluated this approach internally;
> >>> however, in production environments, it is almost impossible to predict
> >>> the exact amount of memory a VM will require.
> >>
> >> Of course, you could preallocate to the expected maximum and then
> >> truncate the file to the size you need :)
> > 
> > The solution you described seems similar to delegating hugepage
> > management to a userspace daemon. I haven't explored this approach
> > before, but it appears quite complex. Beyond ensuring secure memory
> > isolation between VMs, we would also need to handle scenarios where
> > the management daemon or the QEMU process crashes, which implies
> > implementing robust recovery and memory reclamation mechanisms. 
> 
> Yes, but I don't think that's particularly complicated. You have to 
> remove the backing file, yes.
> 
> > Do
> > you happen to have any documentation or references regarding
> > userspace hugepage management that I could look into? 
> 
> Not really any documentation. I pretty much only know how QEMU+libvirt 
> ends up using it :)
> 
> > Compared to
> > the userspace approach, I wonder if implementing hugepage
> > pre-zeroing directly within the kernel would be a simpler and more
> > direct way to accelerate VM creation.
> 
> I mean, yes. I don't particularly enjoy user-space having to poll for 
> pre-zeroing of pages ... it feels like an odd interface for something 
> that is supposed to be simple.
> 
> I do understand the reasoning that "zeroing must be charged to 
> somebody", and that using a kthread is a bit suboptimal as well.

My previous explanation may have caused misunderstanding. This
patchset merely exports an interface that allows users to initiate
and halt page zeroing on demand; the CPU cost is borne by the user,
and no kernel thread is introduced.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 2 days ago

On 1/14/26 12:36, Li Zhe wrote:
> On Wed, 14 Jan 2026 11:41:48 +0100, david@kernel.org wrote:
>   
>> On 1/13/26 13:41, Li Zhe wrote:
>>> On Tue, 13 Jan 2026 11:15:29 +0100, david@kernel.org wrote:
>>>
>>>> On 1/13/26 07:37, Li Zhe wrote:
>>>>> On Mon, 12 Jan 2026 20:52:12 +0100, david@kernel.org wrote:
>>>>>
>>>>>>> As for concern (4), I believe it is orthogonal to this patchset, and
>>>>>>> the cover letter already contains a performance comparison that
>>>>>>> demonstrates the additional benefit.
>>>>>>>
>>>>>>>> I did see some comments in [1] about QEMU supporting user-mode
>>>>>>>> parallel zero-page operations; I'm just not sure what the current
>>>>>>>> state of that support looks like, or what the corresponding benchmark
>>>>>>>> numbers are.
>>>>>>>
>>>>>>> As noted above, QEMU already employs a parallel page-touch mechanism,
>>>>>>> yet the elapsed time remains noticeable. I am not deeply familiar with
>>>>>>> QEMU; please correct me if I am mistaken.
>>>>>>
>>>>>> I implemented some part of the parallel preallocation support in QEMU.
>>>>>>
>>>>>> With QEMU, you can specify the number of threads and even specify the
>>>>>> NUMA-placement of these threads. So you can pretty much fine-tune that
>>>>>> for an environment.
>>>>>>
>>>>>> You still pre-zero all hugetlb pages at VM startup time, just in
>>>>>> parallel though. So you pay some price at APP startup time.
>>>>>
>>>>> Hi David,
>>>>>
>>>>> Thank you for the comprehensive explanation.
>>>>>
>>>>> You are absolutely correct: QEMU's parallel preallocation is performed
>>>>> only during VM start-up. We submitted this patch series mainly
>>>>> because we observed that, even with the existing parallel mechanism,
>>>>> launching large-size VMs still incurs prohibitive delays. (Bringing up
>>>>> a 2 TB VM still requires more than 40 seconds for zeroing)
>>>>>
>>>>>> If you know that you will run such a VM (or something else) later, you
>>>>>> could pre-zero the memory from user space by using a hugetlb-backed file
>>>>>> and supplying that to QEMU as memory backend for the VM. Then, you can
>>>>>> start your VM without any pre-zeroing.
>>>>>>
>>>>>> I guess that approach should work universally. Of course, there are
>>>>>> limitations, as you would have to know how much memory an app needs, and
>>>>>> have a way to supply that memory in form of a file to that app.
>>>>>
>>>>> Regarding user-space pre-zeroing, I agree that it is feasible once the
>>>>> VM's memory footprint is known. We evaluated this approach internally;
>>>>> however, in production environments, it is almost impossible to predict
>>>>> the exact amount of memory a VM will require.
>>>>
>>>> Of course, you could preallocate to the expected maximum and then
>>>> truncate the file to the size you need :)
>>>
>>> The solution you described seems similar to delegating hugepage
>>> management to a userspace daemon. I haven't explored this approach
>>> before, but it appears quite complex. Beyond ensuring secure memory
>>> isolation between VMs, we would also need to handle scenarios where
>>> the management daemon or the QEMU process crashes, which implies
>>> implementing robust recovery and memory reclamation mechanisms.
>>
>> Yes, but I don't think that's particularly complicated. You have to
>> remove the backing file, yes.
>>
>>> Do
>>> you happen to have any documentation or references regarding
>>> userspace hugepage management that I could look into?
>>
>> Not really any documentation. I pretty much only know how QEMU+libvirt
>> ends up using it :)
>>
>>> Compared to
>>> the userspace approach, I wonder if implementing hugepage
>>> pre-zeroing directly within the kernel would be a simpler and more
>>> direct way to accelerate VM creation.
>>
>> I mean, yes. I don't particularly enjoy user-space having to poll for
>> pre-zeroing of pages ... it feels like an odd interface for something
>> that is supposed to be simple.
>>
>> I do understand the reasoning that "zeroing must be charged to
>> somebody", and that using a kthread is a bit suboptimal as well.
> 
> My previous explanation may have caused misunderstanding. This
> patchset merely exports an interface that allows users to initiate
> and halt page zeroing on demand; the CPU cost is borne by the user,
> and no kernel thread is introduced.

You said "I wonder if implementing hugepage pre-zeroing directly within 
the kernel would be a simpler and more direct way to accelerate VM 
creation".

And I agree. But to make that fly (no user space polling interface), I 
was wondering whether we could do it like "init_on_free" and let whoever 
frees a hugetlb folio just reinitialize it with 0.

No kernel thread, no user space thread involved.

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Mateusz Guzik 3 weeks, 2 days ago

On Wed, Jan 14, 2026 at 12:55 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
> You said "I wonder if implementing hugepage pre-zeroing directly within
> the kernel would be a simpler and more direct way to accelerate VM
> creation".
>
> And I agree. But to make that fly (no user space polling interface), I
> was wondering whether we could do it like "init_on_free" and let whoever
> frees a hugetlb folio just reinitialize it with 0.
>
> No kernel thread, no user space thread involved.
>

i don't see how this is supposed to address the stated problem of
zeroing being incredibly expensive.

With machinery to pre-zero and depending on availability of CPU time +
pages eligible for allocation but not yet zeroed vs vm
startups/teardowns frequency, there is some amount of real time which
wont be spent waiting on said zeroing because it was already done.

Any approach which keeps the overhead with the program allocating the
page can't take advantage of it, even if said overhead is paid at the
end of its life.

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 2 days ago

On 1/14/26 13:11, Mateusz Guzik wrote:
> On Wed, Jan 14, 2026 at 12:55 PM David Hildenbrand (Red Hat)
> <david@kernel.org> wrote:
>> You said "I wonder if implementing hugepage pre-zeroing directly within
>> the kernel would be a simpler and more direct way to accelerate VM
>> creation".
>>
>> And I agree. But to make that fly (no user space polling interface), I
>> was wondering whether we could do it like "init_on_free" and let whoever
>> frees a hugetlb folio just reinitialize it with 0.
>>
>> No kernel thread, no user space thread involved.
>>
> 
> i don't see how this is supposed to address the stated problem of
> zeroing being incredibly expensive.

The price of zeroing has to be paid somewhere.

Currently it's done at allocation time, we could move it to freeing time.

That would make application startup faster and application shutdown slower.

And we're aware that application shutdown can be expensive, which is why 
e.g., QEMU implements an async shutdown operation, where the MM gets 
torn down from another process.

> 
> With machinery to pre-zero and depending on availability of CPU time +
> pages eligible for allocation but not yet zeroed vs vm
> startups/teardowns frequency, there is some amount of real time which
> wont be spent waiting on said zeroing because it was already done.
> 
> Any approach which keeps the overhead with the program allocating the
> page can't take advantage of it, even if said overhead is paid at the
> end of its life.

Let's read again at the main use case of this change here is, as stated:

"... there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay."

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 2 days ago

On 1/14/26 13:33, David Hildenbrand (Red Hat) wrote:
> On 1/14/26 13:11, Mateusz Guzik wrote:
>> On Wed, Jan 14, 2026 at 12:55 PM David Hildenbrand (Red Hat)
>> <david@kernel.org> wrote:
>>> You said "I wonder if implementing hugepage pre-zeroing directly within
>>> the kernel would be a simpler and more direct way to accelerate VM
>>> creation".
>>>
>>> And I agree. But to make that fly (no user space polling interface), I
>>> was wondering whether we could do it like "init_on_free" and let whoever
>>> frees a hugetlb folio just reinitialize it with 0.
>>>
>>> No kernel thread, no user space thread involved.
>>>
>>
>> i don't see how this is supposed to address the stated problem of
>> zeroing being incredibly expensive.
> 
> The price of zeroing has to be paid somewhere.
> 
> Currently it's done at allocation time, we could move it to freeing time.
> 
> That would make application startup faster and application shutdown slower.
> 
> And we're aware that application shutdown can be expensive, which is why
> e.g., QEMU implements an async shutdown operation, where the MM gets
> torn down from another process.

Also, just to mention it, assuming a VM is backed by a hugetlb file, the 
user space thread destroying that file (or parts of it by punshing holes 
and freeing hugetlb folios) would be paying that price.

That could be done whenever there is a CPU to spare to perform some freeing.

But again, I think the main motivation here is "increase application 
startup", not optimize that the zeroing happens at specific points in 
time during system operation (e.g., when idle etc).

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Mateusz Guzik 3 weeks, 2 days ago

On Wed, Jan 14, 2026 at 1:41 PM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 1/14/26 13:33, David Hildenbrand (Red Hat) wrote:
> > On 1/14/26 13:11, Mateusz Guzik wrote:
> >> On Wed, Jan 14, 2026 at 12:55 PM David Hildenbrand (Red Hat)
> >> <david@kernel.org> wrote:
> >>> You said "I wonder if implementing hugepage pre-zeroing directly within
> >>> the kernel would be a simpler and more direct way to accelerate VM
> >>> creation".
> >>>
> >>> And I agree. But to make that fly (no user space polling interface), I
> >>> was wondering whether we could do it like "init_on_free" and let whoever
> >>> frees a hugetlb folio just reinitialize it with 0.
> >>>
> >>> No kernel thread, no user space thread involved.
> >>>
> >>
> >> i don't see how this is supposed to address the stated problem of
> >> zeroing being incredibly expensive.
> >
> > The price of zeroing has to be paid somewhere.
> >

Of course.

I'm stating that with dedicated threads for zeroing, provided there is
some memory available along with cpu time, it can be paid while
nothing actively needs these pages.

> > Currently it's done at allocation time, we could move it to freeing time.
> >
> > That would make application startup faster and application shutdown slower.
> >
> > And we're aware that application shutdown can be expensive, which is why
> > e.g., QEMU implements an async shutdown operation, where the MM gets
> > torn down from another process.
>
> Also, just to mention it, assuming a VM is backed by a hugetlb file, the
> user space thread destroying that file (or parts of it by punshing holes
> and freeing hugetlb folios) would be paying that price.
>
> That could be done whenever there is a CPU to spare to perform some freeing.
>
> But again, I think the main motivation here is "increase application
> startup", not optimize that the zeroing happens at specific points in
> time during system operation (e.g., when idle etc).
>

Framing this as "increase application startup" and merely shifting the
overhead to shutdown seems like gaming the problem statement to me.
The real problem is total real time spent on it while pages are
needed.

Support for background zeroing can give you more usable pages provided
it has the cpu + ram to do it. If it does not, you are in the worst
case in the same spot as with zeroing on free.

Let's take a look at some examples.

Say there are no free huge pages and you kill a vm + start a new one.
On top of that all CPUs are pegged as is. In this case total time is
the same for "zero on free" as it is for background zeroing.

Say the system is freshly booted and you start up a vm. There are no
pre-zeroed pages available so it suffers at start time no matter what.
However, with some support for background zeroing, the machinery could
respond to demand and do it in parallel in some capacity, shortening
the real time needed.

Say a little bit of real time passes and you start another vm. With
merely zeroing on free there are still no pre-zeroed pages available
so it again suffers the overhead. With background zeroing some of the
that memory would be already sorted out, speeding up said startup.

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 2 days ago

>> But again, I think the main motivation here is "increase application
>> startup", not optimize that the zeroing happens at specific points in
>> time during system operation (e.g., when idle etc).
>>
> 
> Framing this as "increase application startup" and merely shifting the
> overhead to shutdown seems like gaming the problem statement to me.
> The real problem is total real time spent on it while pages are
> needed.
> 
> Support for background zeroing can give you more usable pages provided
> it has the cpu + ram to do it. If it does not, you are in the worst
> case in the same spot as with zeroing on free.
> 
> Let's take a look at some examples.
> 
> Say there are no free huge pages and you kill a vm + start a new one.
> On top of that all CPUs are pegged as is. In this case total time is
> the same for "zero on free" as it is for background zeroing.

Right. If the pages get freed to immediately get allocated again, it 
doesn't really matter who does the freeing. There might be some details, 
of course.

> 
> Say the system is freshly booted and you start up a vm. There are no
> pre-zeroed pages available so it suffers at start time no matter what.
> However, with some support for background zeroing, the machinery could
> respond to demand and do it in parallel in some capacity, shortening
> the real time needed.

Just like for init_on_free, I would start with zeroing these pages 
during boot.

init_on_free assures that all pages in the buddy were zeroed out. Which 
greatly simplifies the implementation, because there is no need to track 
what was initialized and what was not.

It's a good question if initialization during that should be done in 
parallel, possibly asynchronously during boot. Reminds me a bit of 
deferred page initialization during boot. But that is rather an 
extension that could be added somewhat transparently on top later.

If ever required we could dynamically enable this setting for a running 
system. Whoever would enable it (flips the magic toggle) would zero out 
all hugetlb pages that are already in the hugetlb allocator as free, but 
not initialized yet.

But again, these are extensions on top of the basic design of having all 
free hugetlb folios be zeroed.

> 
> Say a little bit of real time passes and you start another vm. With
> merely zeroing on free there are still no pre-zeroed pages available
> so it again suffers the overhead. With background zeroing some of the
> that memory would be already sorted out, speeding up said startup.

The moment they end up in the hugetlb allocator as free folios they 
would have to get initialized.

Now, I am sure there are downsides to this approach (how to speedup 
process exit by parallelizing zeroing, if ever required)? But it sounds 
like being a bit ... simpler without user space changes required. In 
theory :)

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 1 day ago

On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
  
> >> But again, I think the main motivation here is "increase application
> >> startup", not optimize that the zeroing happens at specific points in
> >> time during system operation (e.g., when idle etc).
> >>
> > 
> > Framing this as "increase application startup" and merely shifting the
> > overhead to shutdown seems like gaming the problem statement to me.
> > The real problem is total real time spent on it while pages are
> > needed.
> > 
> > Support for background zeroing can give you more usable pages provided
> > it has the cpu + ram to do it. If it does not, you are in the worst
> > case in the same spot as with zeroing on free.
> > 
> > Let's take a look at some examples.
> > 
> > Say there are no free huge pages and you kill a vm + start a new one.
> > On top of that all CPUs are pegged as is. In this case total time is
> > the same for "zero on free" as it is for background zeroing.
> 
> Right. If the pages get freed to immediately get allocated again, it 
> doesn't really matter who does the freeing. There might be some details, 
> of course.
> 
> > 
> > Say the system is freshly booted and you start up a vm. There are no
> > pre-zeroed pages available so it suffers at start time no matter what.
> > However, with some support for background zeroing, the machinery could
> > respond to demand and do it in parallel in some capacity, shortening
> > the real time needed.
> 
> Just like for init_on_free, I would start with zeroing these pages 
> during boot.
> 
> init_on_free assures that all pages in the buddy were zeroed out. Which 
> greatly simplifies the implementation, because there is no need to track 
> what was initialized and what was not.
> 
> It's a good question if initialization during that should be done in 
> parallel, possibly asynchronously during boot. Reminds me a bit of 
> deferred page initialization during boot. But that is rather an 
> extension that could be added somewhat transparently on top later.
> 
> If ever required we could dynamically enable this setting for a running 
> system. Whoever would enable it (flips the magic toggle) would zero out 
> all hugetlb pages that are already in the hugetlb allocator as free, but 
> not initialized yet.
> 
> But again, these are extensions on top of the basic design of having all 
> free hugetlb folios be zeroed.
> 
> > 
> > Say a little bit of real time passes and you start another vm. With
> > merely zeroing on free there are still no pre-zeroed pages available
> > so it again suffers the overhead. With background zeroing some of the
> > that memory would be already sorted out, speeding up said startup.
> 
> The moment they end up in the hugetlb allocator as free folios they 
> would have to get initialized.
> 
> Now, I am sure there are downsides to this approach (how to speedup 
> process exit by parallelizing zeroing, if ever required)? But it sounds 
> like being a bit ... simpler without user space changes required. In 
> theory :)

I strongly agree that init_on_free strategy effectively eliminates the
latency incurred during VM creation. However, it appears to introduce
two new issues.

First, the process that later allocates a page may not be the one that
freed it, raising the question of which process should bear the cost
of zeroing.

Second, put_page() is executed atomically, making it inappropriate to
invoke clear_page() within that context; off-loading the zeroing to a
workqueue merely reopens the same accounting problem.

Do you have any recommendations regarding these issues?

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 1 day ago

On 1/15/26 10:36, Li Zhe wrote:
> On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
>    
>>>> But again, I think the main motivation here is "increase application
>>>> startup", not optimize that the zeroing happens at specific points in
>>>> time during system operation (e.g., when idle etc).
>>>>
>>>
>>> Framing this as "increase application startup" and merely shifting the
>>> overhead to shutdown seems like gaming the problem statement to me.
>>> The real problem is total real time spent on it while pages are
>>> needed.
>>>
>>> Support for background zeroing can give you more usable pages provided
>>> it has the cpu + ram to do it. If it does not, you are in the worst
>>> case in the same spot as with zeroing on free.
>>>
>>> Let's take a look at some examples.
>>>
>>> Say there are no free huge pages and you kill a vm + start a new one.
>>> On top of that all CPUs are pegged as is. In this case total time is
>>> the same for "zero on free" as it is for background zeroing.
>>
>> Right. If the pages get freed to immediately get allocated again, it
>> doesn't really matter who does the freeing. There might be some details,
>> of course.
>>
>>>
>>> Say the system is freshly booted and you start up a vm. There are no
>>> pre-zeroed pages available so it suffers at start time no matter what.
>>> However, with some support for background zeroing, the machinery could
>>> respond to demand and do it in parallel in some capacity, shortening
>>> the real time needed.
>>
>> Just like for init_on_free, I would start with zeroing these pages
>> during boot.
>>
>> init_on_free assures that all pages in the buddy were zeroed out. Which
>> greatly simplifies the implementation, because there is no need to track
>> what was initialized and what was not.
>>
>> It's a good question if initialization during that should be done in
>> parallel, possibly asynchronously during boot. Reminds me a bit of
>> deferred page initialization during boot. But that is rather an
>> extension that could be added somewhat transparently on top later.
>>
>> If ever required we could dynamically enable this setting for a running
>> system. Whoever would enable it (flips the magic toggle) would zero out
>> all hugetlb pages that are already in the hugetlb allocator as free, but
>> not initialized yet.
>>
>> But again, these are extensions on top of the basic design of having all
>> free hugetlb folios be zeroed.
>>
>>>
>>> Say a little bit of real time passes and you start another vm. With
>>> merely zeroing on free there are still no pre-zeroed pages available
>>> so it again suffers the overhead. With background zeroing some of the
>>> that memory would be already sorted out, speeding up said startup.
>>
>> The moment they end up in the hugetlb allocator as free folios they
>> would have to get initialized.
>>
>> Now, I am sure there are downsides to this approach (how to speedup
>> process exit by parallelizing zeroing, if ever required)? But it sounds
>> like being a bit ... simpler without user space changes required. In
>> theory :)
> 
> I strongly agree that init_on_free strategy effectively eliminates the
> latency incurred during VM creation. However, it appears to introduce
> two new issues.
> 
> First, the process that later allocates a page may not be the one that
> freed it, raising the question of which process should bear the cost
> of zeroing.

Right now the cost is payed by the process that allocates a page. If you
shift that to the freeing path, it's still the same process, just at a
different point in time.

Of course, there are exceptions to that: if you have a hugetlb file that
is shared by multiple processes (-> process that essentially truncates
the file). Or if someone (GUP-pin) holds a reference to a file even after
it was truncated (not common but possible).

With CoW it would be the process that last unmaps the folio. CoW with
hugetlb is fortunately something that is rare (and rather shaky :) ).

> 
> Second, put_page() is executed atomically, making it inappropriate to
> invoke clear_page() within that context; off-loading the zeroing to a
> workqueue merely reopens the same accounting problem.

I thought about this as well. For init_on_free we always invoke it for
up to 4MiB folios during put_page() on x86-64.

See __folio_put()->__free_frozen_pages()->free_pages_prepare()

Where we call kernel_init_pages(page, 1 << order);

So surely, for 2 MiB folios (hugetlb) this is not a problem.

... but then, on arm64 with 64k base pages we have 512 MiB folios
(managed by the buddy!) where this is apparently not a problem? Or is
it and should be fixed?

So I would expect once we go up to 1 GiB, we might only reveal more
areas where we should have optimized in the first case by dropping
the reference outside the spin lock ... and these optimizations would
obviously (unless in hugetlb specific code ...) benefit init_on_free
setups as well (and page poisoning).


Looking at __unmap_hugepage_range(), for example, we already make sure
to not drop the reference while holding the PTL (spinlock).

In general, I think when using MMU gather we drop folio references out
of the PTL, because we know that it can hurt performance badly.

I documented some of the nasty things that can happen with MMU gather in

commit e61abd4490684de379b4a2ef1be2dbde39ac1ced
Author: David Hildenbrand <david@kernel.org>
Date:   Wed Feb 14 21:44:34 2024 +0100

     mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing
     
     In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
     up to 256 folio fragments that span more than one page, before we
     conditionally reschedule.
     
     It's a pain that we have to handle cond_resched() in
     tlb_batch_pages_flush() manually and cannot simply handle it in
     release_pages() -- release_pages() can be called from atomic context.
     Well, in a perfect world we wouldn't have to make our code more
     complicated at all.
     
     With page poisoning and init_on_free, we might now run into soft lockups
     when we free a lot of rather large folio fragments, because page freeing
     time then depends on the actual memory size we are freeing instead of on
     the number of folios that are involved.
     
     In the absolute (unlikely) worst case, on arm64 with 64k we will be able
     to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
     GiB does sound like it might take a while.  But instead of ignoring this
     unlikely case, let's just handle it.


But more general, when dealing with the PTL we try to put folio references outside
the lock (there are some cases in mm/memory.c where we apparently don't do it yet),
because freeing memory can take a while.

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Jonathan Cameron 3 weeks, 1 day ago

On Thu, 15 Jan 2026 12:08:03 +0100
"David Hildenbrand (Red Hat)" <david@kernel.org> wrote:

> On 1/15/26 10:36, Li Zhe wrote:
> > On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
> >      
> >>>> But again, I think the main motivation here is "increase application
> >>>> startup", not optimize that the zeroing happens at specific points in
> >>>> time during system operation (e.g., when idle etc).
> >>>>  
> >>>
> >>> Framing this as "increase application startup" and merely shifting the
> >>> overhead to shutdown seems like gaming the problem statement to me.
> >>> The real problem is total real time spent on it while pages are
> >>> needed.
> >>>
> >>> Support for background zeroing can give you more usable pages provided
> >>> it has the cpu + ram to do it. If it does not, you are in the worst
> >>> case in the same spot as with zeroing on free.
> >>>
> >>> Let's take a look at some examples.
> >>>
> >>> Say there are no free huge pages and you kill a vm + start a new one.
> >>> On top of that all CPUs are pegged as is. In this case total time is
> >>> the same for "zero on free" as it is for background zeroing.  
> >>
> >> Right. If the pages get freed to immediately get allocated again, it
> >> doesn't really matter who does the freeing. There might be some details,
> >> of course.
> >>  
> >>>
> >>> Say the system is freshly booted and you start up a vm. There are no
> >>> pre-zeroed pages available so it suffers at start time no matter what.
> >>> However, with some support for background zeroing, the machinery could
> >>> respond to demand and do it in parallel in some capacity, shortening
> >>> the real time needed.  
> >>
> >> Just like for init_on_free, I would start with zeroing these pages
> >> during boot.
> >>
> >> init_on_free assures that all pages in the buddy were zeroed out. Which
> >> greatly simplifies the implementation, because there is no need to track
> >> what was initialized and what was not.
> >>
> >> It's a good question if initialization during that should be done in
> >> parallel, possibly asynchronously during boot. Reminds me a bit of
> >> deferred page initialization during boot. But that is rather an
> >> extension that could be added somewhat transparently on top later.
> >>
> >> If ever required we could dynamically enable this setting for a running
> >> system. Whoever would enable it (flips the magic toggle) would zero out
> >> all hugetlb pages that are already in the hugetlb allocator as free, but
> >> not initialized yet.
> >>
> >> But again, these are extensions on top of the basic design of having all
> >> free hugetlb folios be zeroed.
> >>  
> >>>
> >>> Say a little bit of real time passes and you start another vm. With
> >>> merely zeroing on free there are still no pre-zeroed pages available
> >>> so it again suffers the overhead. With background zeroing some of the
> >>> that memory would be already sorted out, speeding up said startup.  
> >>
> >> The moment they end up in the hugetlb allocator as free folios they
> >> would have to get initialized.
> >>
> >> Now, I am sure there are downsides to this approach (how to speedup
> >> process exit by parallelizing zeroing, if ever required)? But it sounds
> >> like being a bit ... simpler without user space changes required. In
> >> theory :)  
> > 
> > I strongly agree that init_on_free strategy effectively eliminates the
> > latency incurred during VM creation. However, it appears to introduce
> > two new issues.
> > 
> > First, the process that later allocates a page may not be the one that
> > freed it, raising the question of which process should bear the cost
> > of zeroing.  
> 
> Right now the cost is payed by the process that allocates a page. If you
> shift that to the freeing path, it's still the same process, just at a
> different point in time.
> 
> Of course, there are exceptions to that: if you have a hugetlb file that
> is shared by multiple processes (-> process that essentially truncates
> the file). Or if someone (GUP-pin) holds a reference to a file even after
> it was truncated (not common but possible).
> 
> With CoW it would be the process that last unmaps the folio. CoW with
> hugetlb is fortunately something that is rare (and rather shaky :) ).
> 
> > 
> > Second, put_page() is executed atomically, making it inappropriate to
> > invoke clear_page() within that context; off-loading the zeroing to a
> > workqueue merely reopens the same accounting problem.  
> 
> I thought about this as well. For init_on_free we always invoke it for
> up to 4MiB folios during put_page() on x86-64.
> 
> See __folio_put()->__free_frozen_pages()->free_pages_prepare()
> 
> Where we call kernel_init_pages(page, 1 << order);
> 
> So surely, for 2 MiB folios (hugetlb) this is not a problem.
> 
> ... but then, on arm64 with 64k base pages we have 512 MiB folios
> (managed by the buddy!) where this is apparently not a problem? Or is
> it and should be fixed?
> 
> So I would expect once we go up to 1 GiB, we might only reveal more
> areas where we should have optimized in the first case by dropping
> the reference outside the spin lock ... and these optimizations would
> obviously (unless in hugetlb specific code ...) benefit init_on_free
> setups as well (and page poisoning).

FWIW I'd be interesting in seeing if we can do the zeroing async and allow
for hardware offloading. If it happens to be in CXL (and someone
built the fancy bits) we can ask the device to zero ranges of memory
for us.  If they built the HDM-DB stuff it's coherent too (came up
in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
support)
+CC linux-cxl and Davidlohr + a few others.

More locally this sounds like fun for DMA engines, though they are going
to rapidly eat bandwidth up and so we'll need QoS stuff in place
to stop them perturbing other workloads.

Give me a list of 1Gig pages and this stuff becomes much more efficient
than anything the CPU can do.

Jonathan

> 
> 
> Looking at __unmap_hugepage_range(), for example, we already make sure
> to not drop the reference while holding the PTL (spinlock).
> 
> In general, I think when using MMU gather we drop folio references out
> of the PTL, because we know that it can hurt performance badly.
> 
> I documented some of the nasty things that can happen with MMU gather in
> 
> commit e61abd4490684de379b4a2ef1be2dbde39ac1ced
> Author: David Hildenbrand <david@kernel.org>
> Date:   Wed Feb 14 21:44:34 2024 +0100
> 
>      mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing
>      
>      In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
>      up to 256 folio fragments that span more than one page, before we
>      conditionally reschedule.
>      
>      It's a pain that we have to handle cond_resched() in
>      tlb_batch_pages_flush() manually and cannot simply handle it in
>      release_pages() -- release_pages() can be called from atomic context.
>      Well, in a perfect world we wouldn't have to make our code more
>      complicated at all.
>      
>      With page poisoning and init_on_free, we might now run into soft lockups
>      when we free a lot of rather large folio fragments, because page freeing
>      time then depends on the actual memory size we are freeing instead of on
>      the number of folios that are involved.
>      
>      In the absolute (unlikely) worst case, on arm64 with 64k we will be able
>      to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
>      GiB does sound like it might take a while.  But instead of ignoring this
>      unlikely case, let's just handle it.
> 
> 
> But more general, when dealing with the PTL we try to put folio references outside
> the lock (there are some cases in mm/memory.c where we apparently don't do it yet),
> because freeing memory can take a while.
>

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 1 day ago

On 1/15/26 12:57, Jonathan Cameron wrote:
> On Thu, 15 Jan 2026 12:08:03 +0100
> "David Hildenbrand (Red Hat)" <david@kernel.org> wrote:
> 
>> On 1/15/26 10:36, Li Zhe wrote:
>>> On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
>>>       
>>>>>> But again, I think the main motivation here is "increase application
>>>>>> startup", not optimize that the zeroing happens at specific points in
>>>>>> time during system operation (e.g., when idle etc).
>>>>>>   
>>>>>
>>>>> Framing this as "increase application startup" and merely shifting the
>>>>> overhead to shutdown seems like gaming the problem statement to me.
>>>>> The real problem is total real time spent on it while pages are
>>>>> needed.
>>>>>
>>>>> Support for background zeroing can give you more usable pages provided
>>>>> it has the cpu + ram to do it. If it does not, you are in the worst
>>>>> case in the same spot as with zeroing on free.
>>>>>
>>>>> Let's take a look at some examples.
>>>>>
>>>>> Say there are no free huge pages and you kill a vm + start a new one.
>>>>> On top of that all CPUs are pegged as is. In this case total time is
>>>>> the same for "zero on free" as it is for background zeroing.
>>>>
>>>> Right. If the pages get freed to immediately get allocated again, it
>>>> doesn't really matter who does the freeing. There might be some details,
>>>> of course.
>>>>   
>>>>>
>>>>> Say the system is freshly booted and you start up a vm. There are no
>>>>> pre-zeroed pages available so it suffers at start time no matter what.
>>>>> However, with some support for background zeroing, the machinery could
>>>>> respond to demand and do it in parallel in some capacity, shortening
>>>>> the real time needed.
>>>>
>>>> Just like for init_on_free, I would start with zeroing these pages
>>>> during boot.
>>>>
>>>> init_on_free assures that all pages in the buddy were zeroed out. Which
>>>> greatly simplifies the implementation, because there is no need to track
>>>> what was initialized and what was not.
>>>>
>>>> It's a good question if initialization during that should be done in
>>>> parallel, possibly asynchronously during boot. Reminds me a bit of
>>>> deferred page initialization during boot. But that is rather an
>>>> extension that could be added somewhat transparently on top later.
>>>>
>>>> If ever required we could dynamically enable this setting for a running
>>>> system. Whoever would enable it (flips the magic toggle) would zero out
>>>> all hugetlb pages that are already in the hugetlb allocator as free, but
>>>> not initialized yet.
>>>>
>>>> But again, these are extensions on top of the basic design of having all
>>>> free hugetlb folios be zeroed.
>>>>   
>>>>>
>>>>> Say a little bit of real time passes and you start another vm. With
>>>>> merely zeroing on free there are still no pre-zeroed pages available
>>>>> so it again suffers the overhead. With background zeroing some of the
>>>>> that memory would be already sorted out, speeding up said startup.
>>>>
>>>> The moment they end up in the hugetlb allocator as free folios they
>>>> would have to get initialized.
>>>>
>>>> Now, I am sure there are downsides to this approach (how to speedup
>>>> process exit by parallelizing zeroing, if ever required)? But it sounds
>>>> like being a bit ... simpler without user space changes required. In
>>>> theory :)
>>>
>>> I strongly agree that init_on_free strategy effectively eliminates the
>>> latency incurred during VM creation. However, it appears to introduce
>>> two new issues.
>>>
>>> First, the process that later allocates a page may not be the one that
>>> freed it, raising the question of which process should bear the cost
>>> of zeroing.
>>
>> Right now the cost is payed by the process that allocates a page. If you
>> shift that to the freeing path, it's still the same process, just at a
>> different point in time.
>>
>> Of course, there are exceptions to that: if you have a hugetlb file that
>> is shared by multiple processes (-> process that essentially truncates
>> the file). Or if someone (GUP-pin) holds a reference to a file even after
>> it was truncated (not common but possible).
>>
>> With CoW it would be the process that last unmaps the folio. CoW with
>> hugetlb is fortunately something that is rare (and rather shaky :) ).
>>
>>>
>>> Second, put_page() is executed atomically, making it inappropriate to
>>> invoke clear_page() within that context; off-loading the zeroing to a
>>> workqueue merely reopens the same accounting problem.
>>
>> I thought about this as well. For init_on_free we always invoke it for
>> up to 4MiB folios during put_page() on x86-64.
>>
>> See __folio_put()->__free_frozen_pages()->free_pages_prepare()
>>
>> Where we call kernel_init_pages(page, 1 << order);
>>
>> So surely, for 2 MiB folios (hugetlb) this is not a problem.
>>
>> ... but then, on arm64 with 64k base pages we have 512 MiB folios
>> (managed by the buddy!) where this is apparently not a problem? Or is
>> it and should be fixed?
>>
>> So I would expect once we go up to 1 GiB, we might only reveal more
>> areas where we should have optimized in the first case by dropping
>> the reference outside the spin lock ... and these optimizations would
>> obviously (unless in hugetlb specific code ...) benefit init_on_free
>> setups as well (and page poisoning).
> 
> FWIW I'd be interesting in seeing if we can do the zeroing async and allow
> for hardware offloading. If it happens to be in CXL (and someone
> built the fancy bits) we can ask the device to zero ranges of memory
> for us.  If they built the HDM-DB stuff it's coherent too (came up
> in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
> support)
> +CC linux-cxl and Davidlohr + a few others.
> 
> More locally this sounds like fun for DMA engines, though they are going
> to rapidly eat bandwidth up and so we'll need QoS stuff in place
> to stop them perturbing other workloads.
> 
> Give me a list of 1Gig pages and this stuff becomes much more efficient
> than anything the CPU can do.

Right, and ideally we'd implement any such mechanisms in a way that more 
parts of the kernel can benefit, and not just an unloved in-memory 
file-system that most people just want to get rid of as soon as we can :)

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by dan.j.williams@intel.com 3 weeks, 1 day ago

David Hildenbrand (Red Hat) wrote:
[..]
> > Give me a list of 1Gig pages and this stuff becomes much more efficient
> > than anything the CPU can do.
> 
> Right, and ideally we'd implement any such mechanisms in a way that more 
> parts of the kernel can benefit, and not just an unloved in-memory 
> file-system that most people just want to get rid of as soon as we can :)

CPUs have tended to eat the value of simple DMA offload operations like
copy/zero over time.

In the case of this patch there is no async-offload benefit because
userspace is already charged with spawning more threads if it wants more
parallelism.

For sync-offload one engine may be able to beat a single CPU, but now
you have created bandwidth contention problems with a component that is
less responsive to the scheduler.

Call me skeptical.

Signed, someone with async_tx and dmaengine battle scars.

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 3 weeks, 1 day ago

On 1/15/26 21:16, dan.j.williams@intel.com wrote:
> David Hildenbrand (Red Hat) wrote:
> [..]
>>> Give me a list of 1Gig pages and this stuff becomes much more efficient
>>> than anything the CPU can do.
>>
>> Right, and ideally we'd implement any such mechanisms in a way that more
>> parts of the kernel can benefit, and not just an unloved in-memory
>> file-system that most people just want to get rid of as soon as we can :)
> 
> CPUs have tended to eat the value of simple DMA offload operations like
> copy/zero over time.
> 
> In the case of this patch there is no async-offload benefit because
> userspace is already charged with spawning more threads if it wants more
> parallelism.

In this subthread we're discussing handling that in the kernel like 
init_on_free. So when user space frees a hugetlb folio (or in the 
future, other similarly gigantic folios from another allocator), we'd be 
zeroing it.

If it would be freeing multiple such folios, we could pack them and send 
them to a DMA engine to zero them for us (concurrently? asynchronously? 
I don't know :) )

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Ankur Arora 3 weeks, 1 day ago

David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 1/15/26 21:16, dan.j.williams@intel.com wrote:
>> David Hildenbrand (Red Hat) wrote:
>> [..]
>>>> Give me a list of 1Gig pages and this stuff becomes much more efficient
>>>> than anything the CPU can do.
>>>
>>> Right, and ideally we'd implement any such mechanisms in a way that more
>>> parts of the kernel can benefit, and not just an unloved in-memory
>>> file-system that most people just want to get rid of as soon as we can :)
>> CPUs have tended to eat the value of simple DMA offload operations like
>> copy/zero over time.
>> In the case of this patch there is no async-offload benefit because
>> userspace is already charged with spawning more threads if it wants more
>> parallelism.
>
> In this subthread we're discussing handling that in the kernel like
> init_on_free. So when user space frees a hugetlb folio (or in the 
> future, other similarly gigantic folios from another allocator), we'd be zeroing
> it.
>
> If it would be freeing multiple such folios, we could pack them and send them to
> a DMA engine to zero them for us (concurrently? asynchronously? I don't know :)
> )

I've been thinking about using non-temporal instructions (movnt/clzero)
for zeroing in that path.

Both the DMA engine and non-temporal zeroing would also improve things
because we won't be bringing free buffers to the cache while zeroing.

-- 
ankur

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 2 weeks, 4 days ago

In light of the preceding discussion, we appear to have reached the
following understanding:

(1) At present we prefer to mitigate slow application startup (e.g.,
VM creation) by zeroing huge pages at the moment they are freed
(init_on_free). The principal benefit is that user space gains the
performance improvement without deploying any additional user space
daemon.

(2) Deferring the zeroing from allocation to release may occasionally
cause the thread that frees the page to differ from the one that
originally allocates it, so the clearing cost is not charged to the
allocating thread. Because this situation is rare and the existing
init_on_free mechanism in the kernel already exhibits the same
behavior, we deem the consequence acceptable.

(3) The function __unmap_hugepage_range() employs the MMU-gather
mechanism, which refrains from dropping the page reference while
holding the PTL (spinlock). This allows huge-page zeroing to be
performed in a non-atomic context.

(4) Given that, in the vast majority of cases, the same thread that
allocates a huge page also frees it, and the exceptions highlighted
by David are genuinely rare[1]. We can achieve faster application
startup by implementing an init_on_free-style mechanism.

(5) Going forward we can further optimize the zeroing process by
leveraging a DMA engine.

If the foregoing is accurate, I propose we add a new hugetlbfs mount
option to achieve the init-on-free behavior.

Thanks,
Zhe

[1]: https://lore.kernel.org/all/83798495-915b-4a5d-9638-f5b3de913b71@kernel.org/#t

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Laight 2 weeks, 3 days ago

On Tue, 20 Jan 2026 14:27:06 +0800
"Li Zhe" <lizhe.67@bytedance.com> wrote:

> In light of the preceding discussion, we appear to have reached the
> following understanding:
> 
> (1) At present we prefer to mitigate slow application startup (e.g.,
> VM creation) by zeroing huge pages at the moment they are freed
> (init_on_free). The principal benefit is that user space gains the
> performance improvement without deploying any additional user space
> daemon.

Am I missing something?
If userspace does:
$ program_a; program_b
and pages used by program_a are zeroed when it exits you get the delay
for zeroing all the pages it used before program_b starts.
OTOH if the zeroing is deferred program_b only needs to zero the pages
it needs to start (and there may be some lurking).

The only real gain has to come from zeroing pages when the system is idle.
That will give plenty of zeroed pages needed for starting a web browser
from the desktop and also speed up single-threaded things like 'make -j1'.

	David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 2 weeks, 2 days ago

On 1/20/26 10:47, David Laight wrote:
> On Tue, 20 Jan 2026 14:27:06 +0800
> "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
>> In light of the preceding discussion, we appear to have reached the
>> following understanding:
>>
>> (1) At present we prefer to mitigate slow application startup (e.g.,
>> VM creation) by zeroing huge pages at the moment they are freed
>> (init_on_free). The principal benefit is that user space gains the
>> performance improvement without deploying any additional user space
>> daemon.
> 
> Am I missing something?
> If userspace does:
> $ program_a; program_b
> and pages used by program_a are zeroed when it exits you get the delay
> for zeroing all the pages it used before program_b starts.
> OTOH if the zeroing is deferred program_b only needs to zero the pages
> it needs to start (and there may be some lurking).

Can you point me to where was that spelled out as a requirement?

> 
> The only real gain has to come from zeroing pages when the system is idle.
> That will give plenty of zeroed pages needed for starting a web browser
> from the desktop and also speed up single-threaded things like 'make -j1'.

I am strictly against over-engineering any features that hugetlb 
benefits from only.

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 2 weeks, 3 days ago

On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:

> On Tue, 20 Jan 2026 14:27:06 +0800
> "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
> > In light of the preceding discussion, we appear to have reached the
> > following understanding:
> > 
> > (1) At present we prefer to mitigate slow application startup (e.g.,
> > VM creation) by zeroing huge pages at the moment they are freed
> > (init_on_free). The principal benefit is that user space gains the
> > performance improvement without deploying any additional user space
> > daemon.
> 
> Am I missing something?
> If userspace does:
> $ program_a; program_b
> and pages used by program_a are zeroed when it exits you get the delay
> for zeroing all the pages it used before program_b starts.
> OTOH if the zeroing is deferred program_b only needs to zero the pages
> it needs to start (and there may be some lurking).

Under the init_on-free approach, improving the speed of zeroing may
indeed prove necessary.

However, I believe we should first reach consensus on adopting
“init_on_free” as the solution to slow application startup before
turning to performance tuning.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Gregory Price 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> 
> > On Tue, 20 Jan 2026 14:27:06 +0800
> > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > 
> > > In light of the preceding discussion, we appear to have reached the
> > > following understanding:
> > > 
> > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > VM creation) by zeroing huge pages at the moment they are freed
> > > (init_on_free). The principal benefit is that user space gains the
> > > performance improvement without deploying any additional user space
> > > daemon.
> > 
> > Am I missing something?
> > If userspace does:
> > $ program_a; program_b
> > and pages used by program_a are zeroed when it exits you get the delay
> > for zeroing all the pages it used before program_b starts.
> > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > it needs to start (and there may be some lurking).
> 
> Under the init_on-free approach, improving the speed of zeroing may
> indeed prove necessary.
> 
> However, I believe we should first reach consensus on adopting
> “init_on_free” as the solution to slow application startup before
> turning to performance tuning.
> 

His point was init_on_free may not actually reduce any delays on serial
applications, and can actually introduce additional delays.

Example
-------
program_a:  alloc_hugepages(10);
            exit();

program b:  alloc_hugepages(5);
	    exit();

/* Run programs in serial */
sh:  program_a && program_b

in zero_on_alloc():
	program_a eats zero(10) cost on startup
	program_b eats zero(5) cost on startup
	Overall zero(15) cost to start program_b

in zero_on_free()
	program_a eats zero(10) cost on startup
	program_a eats zero(10) cost on exit
	program_b eats zero(0) cost on startup
	Overall zero(20) cost to start program_b

zero_on_free is worse by zero(5)
-------

This is a trivial example, but it's unclear zero_on_free actually
provides a benefit.  You have to know ahead of time what the runtime
behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
be to determine whether there's an actual reduction in startup time.

But just trivially, starting from the base case of no pages being
zeroed, you're just injecting an additional zero(X) cost if program_a()
consumes more hugepages than program_b().

Long way of saying the shift from alloc to free seems heuristic-y and
you need stronger analysis / better data to show this change is actually
beneficial in the general case.

~Gregory

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Hildenbrand (Red Hat) 2 weeks, 2 days ago

On 1/20/26 19:18, Gregory Price wrote:
> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
>> On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
>>
>>> On Tue, 20 Jan 2026 14:27:06 +0800
>>> "Li Zhe" <lizhe.67@bytedance.com> wrote:
>>>
>>>
>>> Am I missing something?
>>> If userspace does:
>>> $ program_a; program_b
>>> and pages used by program_a are zeroed when it exits you get the delay
>>> for zeroing all the pages it used before program_b starts.
>>> OTOH if the zeroing is deferred program_b only needs to zero the pages
>>> it needs to start (and there may be some lurking).
>>
>> Under the init_on-free approach, improving the speed of zeroing may
>> indeed prove necessary.
>>
>> However, I believe we should first reach consensus on adopting
>> “init_on_free” as the solution to slow application startup before
>> turning to performance tuning.
>>
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>              exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup
> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.

For VMs with hugetlb people usually have some spare pages lying around. 
VM startup time is more important for cloud providers than VM shutdown time.

I'm sure there are examples where it is the other way around, but having 
mixed workloads on the system is likely not the highest priority right now.

> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().


And whatever you do,

program_a()
program_b()

will have to zero the pages.

No asynchronous mechanism will really help.

> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.

I think the principle of "the allocator already contains zeroed pages" 
is quite universal and simple.

Whether you want to zero the pages actually when the last reference is 
gone (like we do in the buddy), or have that happen from some 
asynchonrous context is an rather an internal optimization.

-- 
Cheers

David

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 2 weeks, 3 days ago

On Tue, 20 Jan 2026 13:18:19 -0500, gourry@gourry.net wrote:

> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> > On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> > 
> > > On Tue, 20 Jan 2026 14:27:06 +0800
> > > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > > 
> > > > In light of the preceding discussion, we appear to have reached the
> > > > following understanding:
> > > > 
> > > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > > VM creation) by zeroing huge pages at the moment they are freed
> > > > (init_on_free). The principal benefit is that user space gains the
> > > > performance improvement without deploying any additional user space
> > > > daemon.
> > > 
> > > Am I missing something?
> > > If userspace does:
> > > $ program_a; program_b
> > > and pages used by program_a are zeroed when it exits you get the delay
> > > for zeroing all the pages it used before program_b starts.
> > > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > > it needs to start (and there may be some lurking).
> > 
> > Under the init_on-free approach, improving the speed of zeroing may
> > indeed prove necessary.
> > 
> > However, I believe we should first reach consensus on adopting
> > "init_on_free" as the solution to slow application startup before
> > turning to performance tuning.
> > 
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>             exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup
> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().
> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.

I understand your concern. At some point some process must pay the
cost of zeroing, and the optimal strategy is inevitably
workload-dependent.

Our "zero-on-free for huge pages" draws on the existing kernel
init_on_free mechanism. Of course, it may prove sub-optimal in certain
scenarios.

Consistent with "provide tools, not policy", perhaps the decision is
better left to user space. And that is exactly what this patchset
does. Requiring a userspace daemon to decide when to zero pages
certainly adds complexity, but it also gives administrators a single,
flexible knob that can be tuned for any workload.

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by David Laight 2 weeks, 3 days ago

On Tue, 20 Jan 2026 13:18:19 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Tue, Jan 20, 2026 at 06:39:48PM +0800, Li Zhe wrote:
> > On Tue, 20 Jan 2026 09:47:44 +0000, david.laight.linux@gmail.com wrote:
> >   
> > > On Tue, 20 Jan 2026 14:27:06 +0800
> > > "Li Zhe" <lizhe.67@bytedance.com> wrote:
> > >   
> > > > In light of the preceding discussion, we appear to have reached the
> > > > following understanding:
> > > > 
> > > > (1) At present we prefer to mitigate slow application startup (e.g.,
> > > > VM creation) by zeroing huge pages at the moment they are freed
> > > > (init_on_free). The principal benefit is that user space gains the
> > > > performance improvement without deploying any additional user space
> > > > daemon.  
> > > 
> > > Am I missing something?
> > > If userspace does:
> > > $ program_a; program_b
> > > and pages used by program_a are zeroed when it exits you get the delay
> > > for zeroing all the pages it used before program_b starts.
> > > OTOH if the zeroing is deferred program_b only needs to zero the pages
> > > it needs to start (and there may be some lurking).  
> > 
> > Under the init_on-free approach, improving the speed of zeroing may
> > indeed prove necessary.
> > 
> > However, I believe we should first reach consensus on adopting
> > “init_on_free” as the solution to slow application startup before
> > turning to performance tuning.
> >   
> 
> His point was init_on_free may not actually reduce any delays on serial
> applications, and can actually introduce additional delays.
> 
> Example
> -------
> program_a:  alloc_hugepages(10);
>             exit();
> 
> program b:  alloc_hugepages(5);
> 	    exit();
> 
> /* Run programs in serial */
> sh:  program_a && program_b
> 
> in zero_on_alloc():
> 	program_a eats zero(10) cost on startup
> 	program_b eats zero(5) cost on startup
> 	Overall zero(15) cost to start program_b
> 
> in zero_on_free()
> 	program_a eats zero(10) cost on startup

Do you get that cost? - wont all the unused memory be zeros.

> 	program_a eats zero(10) cost on exit
> 	program_b eats zero(0) cost on startup
> 	Overall zero(20) cost to start program_b
> 
> zero_on_free is worse by zero(5)
> -------
> 
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().

I'd consider a different test:
	for c in $(jot 1 1000); do program_a; done

Regardless of whether you zero on alloc or free all the zeroing is in line.
Move it to a low priority thread (that uses a non-aggressive loop) and
there will be reasonable chance of there being pre-zeroed pages available.
(Most DMA is far too aggressive...)

If you zero on free it might also be a waste of time.
Maybe the memory is next used to read data from a disk file.

	David

> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.
> 
> ~Gregory

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Gregory Price 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 07:30:27PM +0000, David Laight wrote:
> > /* Run programs in serial */
> > sh:  program_a && program_b
> > 
> > in zero_on_alloc():
> > 	program_a eats zero(10) cost on startup
> > 	program_b eats zero(5) cost on startup
> > 	Overall zero(15) cost to start program_b
> > 
> > in zero_on_free()
> > 	program_a eats zero(10) cost on startup
> 
> Do you get that cost? - wont all the unused memory be zeros.
> 

If program_a was the first to access, wouldn't it have had to zero it?

> > But just trivially, starting from the base case of no pages being
> > zeroed, you're just injecting an additional zero(X) cost if program_a()
> > consumes more hugepages than program_b().
> 
> I'd consider a different test:
> 	for c in $(jot 1 1000); do program_a; done
> 
> Regardless of whether you zero on alloc or free all the zeroing is in line.
> Move it to a low priority thread (that uses a non-aggressive loop) and
> there will be reasonable chance of there being pre-zeroed pages available.
> (Most DMA is far too aggressive...)
> 
> If you zero on free it might also be a waste of time.
> Maybe the memory is next used to read data from a disk file.
> 

Right, both points here being that it's heuristic-y, it only applies in
certain scenarios and trying to optimize for one probably hurts another.

~Gregory

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Gregory Price 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 01:18:19PM -0500, Gregory Price wrote:
> This is a trivial example, but it's unclear zero_on_free actually
> provides a benefit.  You have to know ahead of time what the runtime
> behavior, pre-zeroed count, and allocation pattern (0->10->5->...) would
> be to determine whether there's an actual reduction in startup time.
> 
> But just trivially, starting from the base case of no pages being
> zeroed, you're just injecting an additional zero(X) cost if program_a()
> consumes more hugepages than program_b().
> 
> Long way of saying the shift from alloc to free seems heuristic-y and
> you need stronger analysis / better data to show this change is actually
> beneficial in the general case.
> 

As an addendum to this:  Maybe this is an indication that a global
switch (per-node sysfs entry) is not the best decision, and that maybe
there's a better way to accomplish this with a reduced scope.

hugetlb-only sysfs knob
	- same issue as current proposal, but better placed
	  why would you only apply this on one node?

prctl thingy
	- limits effects to just those opting into alloc-on-free
	- probably still needs hugetlb-internal zeroed-pages tracking
	  but doesn't require the rest of the machinery

do it entirely in userland
	- modify the software to zero before exit
	- use MAP_UNINITIALIZED
	- useful and simple if your hugetlb use case is homogenous

there's probably more oprtions

~Gregory

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Andrew Morton 1 month ago

On Wed,  7 Jan 2026 19:31:22 +0800 "Li Zhe" <lizhe.67@bytedance.com> wrote:

> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
> 
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.
> 
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
> 
> This mechanism offers the following advantages:
> 
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
> 
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
> 
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
> 
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
> 
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
> 
> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
> 
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.

This seems to be quite a lot of messing around in userspace.  Perhaps
unavoidable given the tradeoffs which are involved, and reasonable in
the sort of environments in which this will be used.  I guess there are
many alternatives - let's see what others think.

>  fs/hugetlbfs/inode.c    |   3 +-
>  include/linux/hugetlb.h |  26 +++++
>  mm/hugetlb.c            | 131 ++++++++++++++++++++++---
>  mm/hugetlb_internal.h   |   6 ++
>  mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
>  5 files changed, 337 insertions(+), 35 deletions(-)

Let's find places in Documentation/ (and Documentation/ABI) to document
the userspace interface?

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 4 days ago

On Wed, 7 Jan 2026 08:19:21 -0800, akpm@linux-foundation.org wrote:

> > In user space, we can use system calls such as epoll and write to zero
> > huge folios as they become available, and sleep when none are ready. The
> > following pseudocode illustrates this approach. The pseudocode spawns
> > eight threads (each running thread_fun()) that wait for huge pages on
> > node 0 to become eligible for zeroing; whenever such pages are available,
> > the threads clear them in parallel.
> 
> This seems to be quite a lot of messing around in userspace.  Perhaps
> unavoidable given the tradeoffs which are involved, and reasonable in
> the sort of environments in which this will be used.

Apologies for the delayed response. I share the same view.

> I guess there are many alternatives - let's see what others think.

Let's cc the developers who took part in the earlier discussion and
see whether any better ideas emerge.

> >  fs/hugetlbfs/inode.c    |   3 +-
> >  include/linux/hugetlb.h |  26 +++++
> >  mm/hugetlb.c            | 131 ++++++++++++++++++++++---
> >  mm/hugetlb_internal.h   |   6 ++
> >  mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
> >  5 files changed, 337 insertions(+), 35 deletions(-)
> 
> Let's find places in Documentation/ (and Documentation/ABI) to document
> the userspace interface?

I believe this userspace interface should be documented in
'Documentation/admin-guide/mm/hugetlbpage.rst', and I will include
this update in V3.

However, I noticed that no document under 'Documentation/ABI' explicitly
describe the interfaces under directory
'/sys/devices/system/node/nodeX/hugepages/hugepages-<size>/'. Instead, it
points directly to 'Documentation/admin-guide/mm/hugetlbpage.rst' (see
'Documentation/ABI/stable/sysfs-devices-node').

Given this, is it sufficient to only modify
'Documentation/admin-guide/mm/hugetlbpage.rst'?

Thanks,
Zhe

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Ankur Arora 3 weeks, 4 days ago

Li Zhe <lizhe.67@bytedance.com> writes:

> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
>
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.

> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
>
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.
>
>   static void thread_fun(void)
>   {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		if (val > 0)
>   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		epoll_wait();
>   	}
>   }

Given that zeroable_hugepages is per node, anybody who writes to
it would need to know how much the aggregate demand would be.

Seems to me that the only value that might make sense would be "max".
And at that point this approach seems a little bit like init_on_free.

Ankur

>   static void start_pre_zero_thread(int thread_num)
>   {
>   	create_pre_zero_threads(thread_num, thread_fun)
>   }
>
>   int main(void)
>   {
>   	start_pre_zero_thread(8);
>   }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
> [2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u
>
> Li Zhe (8):
>   mm/hugetlb: add pre-zeroed framework
>   mm/hugetlb: convert to prep_account_new_hugetlb_folio()
>   mm/hugetlb: move the huge folio to the end of the list during enqueue
>   mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
>   mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
>   mm/hugetlb: relocate the per-hstate struct kobject pointer
>   mm/hugetlb: add epoll support for interface "zeroable_hugepages"
>   mm/hugetlb: limit event generation frequency of function
>     do_zero_free_notify()
>
>  fs/hugetlbfs/inode.c    |   3 +-
>  include/linux/hugetlb.h |  26 +++++
>  mm/hugetlb.c            | 131 ++++++++++++++++++++++---
>  mm/hugetlb_internal.h   |   6 ++
>  mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
>  5 files changed, 337 insertions(+), 35 deletions(-)
>
> ---
> Changelogs:
>
> v1->v2 :
> - Use guard() to simplify function hpage_wait_zeroing(). (pointed by
>   Raghu)
> - Simplify the logic of zero_free_hugepages_nid() by removing
>   redundant checks and exiting the loop upon encountering a
>   pre-zeroed folio. (pointed by Frank)
> - Include in the cover letter a performance comparison with Ankur's
>   optimization patch[2]. (pointed by Andrew)
>
> v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/


--
ankur

Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 3 weeks, 4 days ago

On Mon, 12 Jan 2026 14:01:29 -0800, ankur.a.arora@oracle.com wrote:

> > In user space, we can use system calls such as epoll and write to zero
> > huge folios as they become available, and sleep when none are ready. The
> > following pseudocode illustrates this approach. The pseudocode spawns
> > eight threads (each running thread_fun()) that wait for huge pages on
> > node 0 to become eligible for zeroing; whenever such pages are available,
> > the threads clear them in parallel.
> >
> >   static void thread_fun(void)
> >   {
> >   	epoll_create();
> >   	epoll_ctl();
> >   	while (1) {
> >   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
> >   		if (val > 0)
> >   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
> >   		epoll_wait();
> >   	}
> >   }
> 
> Given that zeroable_hugepages is per node, anybody who writes to
> it would need to know how much the aggregate demand would be.
> 
> Seems to me that the only value that might make sense would be "max".
> And at that point this approach seems a little bit like init_on_free.

Yes, writing “max” suffices for the vast majority of workloads.

However, once multiple mutually independent application processes each
need huge pages, the ability to specify an exact value becomes
essential, because the CPU time each process spends on zeroing can
then be charged to its own cgroup. If we currently considers “max”
sufficient, we can implement support for that parameter alone and
extend it later when necessary.

Although “max” resembles init_on_free at first glance, it leaves the
decision of “when and on which CPU to zero” entirely to user space,
thereby eliminating the concern previously raised.

Thanks,
Zhe