[v1] Introduce a huge-page pre-zeroing mechanism

[PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by 李喆 1 month, 2 weeks ago

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40
milliseconds for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay.

To accelerate the above scenario, this patchset exports a per-node,
read-write zeroable_hugepages interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.

This mechanism offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).

In user space, we can use system calls such as epoll and write to zero
huge pages as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads that wait for huge pages on node 0 to become eligible for
zeroing; whenever such pages are available, the threads clear them in
parallel.

  static void thread_fun(void)
  {
  	epoll_create();
  	epoll_ctl();
  	while (1) {
  		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		if (val > 0)
  			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		epoll_wait();
  	}
  }
  
  static void start_pre_zero_thread(int thread_num)
  {
  	create_pre_zero_threads(thread_num, thread_fun)
  }
  
  int main(void)
  {
  	start_pre_zero_thread(8);
  }

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Li Zhe (8):
  mm/hugetlb: add pre-zeroed framework
  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  mm/hugetlb: move the huge folio to the end of the list during enqueue
  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  mm/hugetlb: relocate the per-hstate struct kobject pointer
  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  mm/hugetlb: limit event generation frequency of function
    do_zero_free_notify()

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 ++++++
 mm/hugetlb.c            | 133 +++++++++++++++++++++++---
 mm/hugetlb_internal.h   |   6 ++
 mm/hugetlb_sysfs.c      | 202 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 335 insertions(+), 35 deletions(-)

-- 
2.20.1

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Andrew Morton 1 month, 1 week ago

On Thu, 25 Dec 2025 16:20:51 +0800 李喆 <lizhe.67@bytedance.com> wrote:

> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.

Ankur's contiguous page clearing work
(https://lkml.kernel.org/r/20251215204922.475324-1-ankur.a.arora@oracle.com)
will hopefully result in significant changes to the timing observations
in your changelogs?

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 1 month, 1 week ago

On Sun, 28 Dec 2025 13:44:54 -0800, akpm@linux-foundation.org wrote:

> > Fresh hugetlb pages are zeroed out when they are faulted in,
> > just like with all other page types. This can take up a good
> > amount of time for larger page sizes (e.g. around 40
> > milliseconds for a 1G page on a recent AMD-based system).
> > 
> > This normally isn't a problem, since hugetlb pages are typically
> > mapped by the application for a long time, and the initial
> > delay when touching them isn't much of an issue.
> > 
> > However, there are some use cases where a large number of hugetlb
> > pages are touched when an application (such as a VM backed by these
> > pages) starts. For 256 1G pages and 40ms per page, this would take
> > 10 seconds, a noticeable delay.
> 
> Ankur's contiguous page clearing work
> (https://lkml.kernel.org/r/20251215204922.475324-1-ankur.a.arora@oracle.com)
> will hopefully result in significant changes to the timing observations
> in your changelogs?

I conducted the experiment on my Skylake machine; below are the 64-GiB
huge-page fault latencies measured with 1-GiB page size.

Without Ankur's optimization:

	Total time: 15.989429 seconds
	Avg time per 1GB page: 0.249835 seconds

With Ankur's optimization:

	Total time: 12.931696 seconds
	Avg time per 1GB page: 0.202058 seconds

For comparison, when the same 64 GiB of memory was pre-zeroed in
advance by the pre-zeroing mechanism, the test completed in negligible
time.

I will incorporate these findings into the V2 description.

Thanks,
Zhe

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Mateusz Guzik 1 month, 1 week ago

On Thu, Dec 25, 2025 at 04:20:51PM +0800, 李喆 wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
> 
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
> 
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
> 
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.
> 
> To accelerate the above scenario, this patchset exports a per-node,
> read-write zeroable_hugepages interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
> 
> This mechanism offers the following advantages:
> 
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
> 
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
> 
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
> 
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
> 
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
> 
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
> 
> In user space, we can use system calls such as epoll and write to zero
> huge pages as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads that wait for huge pages on node 0 to become eligible for
> zeroing; whenever such pages are available, the threads clear them in
> parallel.
> 
>   static void thread_fun(void)
>   {
>   	epoll_create();
>   	epoll_ctl();
>   	while (1) {
>   		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		if (val > 0)
>   			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>   		epoll_wait();
>   	}
>   }
>   
>   static void start_pre_zero_thread(int thread_num)
>   {
>   	create_pre_zero_threads(thread_num, thread_fun)
>   }
>   
>   int main(void)
>   {
>   	start_pre_zero_thread(8);
>   }
> 

In the name of "provide tools, not policy" making userspace call the
shots is the right approach, which I advocated for in the original
thread.

I do have concerns about the specific interface as I think it is a
little too limited.

Suppose vastly different deployments with different needs. For example
one may want to keep at least n pages ready to use, RAM permitting.

At the same time it perhaps would like to balance CPU usage vs other
tasks, so for example it would control parallelism based on observed
churn rate.

So a toolset I would consider viable would need to provide an extensible
interface to future-proof it.

As for an immediate need not met with the current patchset, there is no
configurable threshold for free zeroed page count to generate a wake up.

I suspect a bunch of ioctls would be needed here.

I don't know if sysfs is viable at all for this. Worst case a device (or
a set of per-node devices) can be created with the same goal.

For illustrative purposes perhaps something like this:

I'm assuming centralized file/device. the node parameter can be dropped
otherwise.

struct hugepage_zero_req {
	int version; /* version of the struct for extensibility purposes; alternatively different versions can use differrent ioctls */
	int node; /* numa node to zero in */
	int pages; /* max pages to zero out in this call */
}

then interested threads can do:
	struct hugepage_zero_req hzr { .node = 0; pages = UINT_MAX; }
	pages = ioctl(hfd, HUGEPAGE_ZERO_PERFORM, &hzr); /* return pages zeroed */

struct hugepage_zero_configure {
	int version;
	int node; /* numa node to watch, open the device more times for other nodes */
	int minfree; /* issue a wake up if free pages drop below this value */
}

and of course:

struct hugepage_zero_query {
	int version;
	int node;
	size_t total_pages; /* all pages installed in the domain */
	size_t free_pages; /* total free pages */
	size_t free_huge_pages; /* total free huge pages */
	size_t zeroed_huge_pages; /* huge pages ready to use */
	size_t pad[MEDIUM_NUM]; /* optionally make extensible without abi breakage, version handling and ioctl renumbering? */
	.... /* whatever else useful, note it can be added later */
}

Then I would imagine a userspace daemon with arbitrary policies can be
written just fine.

Consider this pseudo-code which spawns 8 threads on domain 1 and
dispatches some number of them to zero stuff based on what it sees.

#define THREADS 8
int main(void)
{
	bind_to_domain(1);
	start_pre_zero_threads(THREADS); /* create a thread pool */
	epoll_create();
	struct hugepage_zero_configure hzc { .version = MAGIC_VERSION, .node = 1; minfree = SMALL_NUM; }
	ioctl(hfd, HUGEPAGE_ZERO_CONFIGURE, &hzc);
	hfd = open("/dev/hugepagectl", O_RDWR);
	epoll_ctl();

	for (;;) {
		epoll_wait();

		struct hugepage_zero_query hzq { .version = MAGIC_VERSION, .node = 1 };
		ioctl(hfd, HUGEPAGE_ZERO_QUERY, &hzq);
		if (hzq.free_huge_pages == 0)
			/* nothing which can be done */
			continue;
		tozero = tune_the_request(&hzq);
		/*
		 * get up to THREADS workers zeroing in parallel based on magic policy
		 */
		dispatch(tozero); 
	}
}

If one wants one daemon handling multiple domains it can use open the
file one time per domain to cover it.

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 1 month, 1 week ago

On Sat, 27 Dec 2025 08:21:16 +0100, mjguzik@gmail.com wrote:

> In the name of "provide tools, not policy" making userspace call the
> shots is the right approach, which I advocated for in the original
> thread.

Thank you for your endorsement!

> I do have concerns about the specific interface as I think it is a
> little too limited.
> 
> Suppose vastly different deployments with different needs. For example
> one may want to keep at least n pages ready to use, RAM permitting.
> 
> At the same time it perhaps would like to balance CPU usage vs other
> tasks, so for example it would control parallelism based on observed
> churn rate.
> 
> So a toolset I would consider viable would need to provide an extensible
> interface to future-proof it.
> 
> As for an immediate need not met with the current patchset, there is no
> configurable threshold for free zeroed page count to generate a wake up.
> 
> I suspect a bunch of ioctls would be needed here.
> 
> I don't know if sysfs is viable at all for this. Worst case a device (or
> a set of per-node devices) can be created with the same goal.

In my view, the present kernel framework does not allow an ioctl
interface to be placed under the per-node huge-page directories.

The functionality you describe appears to align closely with that
offered by the cgroup.event_control interface in the memory
controller.

We could therefore introduce a new event_control file for huge-page
events, following the same pattern. Given that all huge-page
attributes already live in sysfs, such an addition would keep the
interface consistent and avoid the extra indirection of a new
/dev/hugepagectl file.

Thanks,
Zhe

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Frank van der Linden 1 month, 1 week ago

On Thu, Dec 25, 2025 at 12:21 AM 李喆 <lizhe.67@bytedance.com> wrote:
>
> From: Li Zhe <lizhe.67@bytedance.com>
>
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write zeroable_hugepages interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
>
> In user space, we can use system calls such as epoll and write to zero
> huge pages as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads that wait for huge pages on node 0 to become eligible for
> zeroing; whenever such pages are available, the threads clear them in
> parallel.
>
>   static void thread_fun(void)
>   {
>         epoll_create();
>         epoll_ctl();
>         while (1) {
>                 val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 if (val > 0)
>                         system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 epoll_wait();
>         }
>   }
>
>   static void start_pre_zero_thread(int thread_num)
>   {
>         create_pre_zero_threads(thread_num, thread_fun)
>   }
>
>   int main(void)
>   {
>         start_pre_zero_thread(8);
>   }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Thanks for taking my patches and extending them!

As far as I can see, you took what I did and then added a framework
for the zeroing to be done in user context, and possibly by multiple
threads, right? There were one or two comments on my original patch
set that objected to the zero cost being taken by a system thread, not
a user thread, so this should address that.

I'll go through them to provide comments inline.

- Frank

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Frank van der Linden 1 month, 1 week ago

Is there any situation where you would write anything else than 'max'
to the new sysfs file? E.g. in which scenarios does it make sense to
*not* pre-zero all freed hugetlb folios? There doesn't seem to be a
point to just doing a certain number. You can't know for sure if the
number you read will remain correct, as it's just a snapshot. So how
would you determine a correct number other than 'max'?

- Frank

Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

Posted by Li Zhe 1 month, 1 week ago

On Fri, 26 Dec 2025 13:42:13 -0800, fvdl@google.com wrote:

> Is there any situation where you would write anything else than 'max'
> to the new sysfs file? E.g. in which scenarios does it make sense to
> *not* pre-zero all freed hugetlb folios? There doesn't seem to be a
> point to just doing a certain number. You can't know for sure if the
> number you read will remain correct, as it's just a snapshot. So how
> would you determine a correct number other than 'max'?

My view is that each application knows its own huge-page requirement
and should therefore write the corresponding number into the
"zeroable_hugepages" interface. Since the zeroing work is accounted
to the application process, the CPU time it consumes can be
constrained through that process's cgroup.

Thanks,
Zhe