[v1] Introduce a huge-page pre-zeroing mechanism

[PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"

Posted by 李喆 1 month, 2 weeks ago

From: Li Zhe <lizhe.67@bytedance.com>

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40 milliseconds
for a 1G page on a recent AMD-based system).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial delay
when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by
these pages) starts. For 256 1G pages and 40ms per page, this would
take 10 seconds, a noticeable delay.

This patch adds a new zeroable_hugepages interface under each
/sys/devices/system/node/node*/hugepages/hugepages-***kB directory.
Reading it returns the number of huge folios of the corresponding size
on that node that are eligible for pre-zeroing. The interface also
accepts an integer x in the range [0, max], enabling user space to
request that x huge pages be zeroed on demand.

Exporting this interface offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Co-developed-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Frank van der Linden <fvdl@google.com>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/hugetlb_sysfs.c | 120 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
index 79ece91406bf..8c3e433209c3 100644
--- a/mm/hugetlb_sysfs.c
+++ b/mm/hugetlb_sysfs.c
@@ -352,6 +352,125 @@ struct node_hstate {
 };
 static struct node_hstate node_hstates[MAX_NUMNODES];
 
+static ssize_t zeroable_hugepages_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h;
+	unsigned long free_huge_pages_zero;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	if (WARN_ON(nid == NUMA_NO_NODE))
+		return -EPERM;
+
+	free_huge_pages_zero = h->free_huge_pages_node[nid] -
+			       h->free_huge_pages_zero_node[nid];
+
+	return sprintf(buf, "%lu\n", free_huge_pages_zero);
+}
+
+static inline bool zero_should_abort(struct hstate *h, int nid)
+{
+	return (h->free_huge_pages_zero_node[nid] ==
+		h->free_huge_pages_node[nid]) ||
+		list_empty(&h->hugepage_freelists[nid]);
+}
+
+static void zero_free_hugepages_nid(struct hstate *h,
+				   int nid, unsigned int nr_zero)
+{
+	struct list_head *freelist = &h->hugepage_freelists[nid];
+	unsigned int nr_zerod = 0;
+	struct folio *folio;
+
+	if (zero_should_abort(h, nid))
+		return;
+
+	spin_lock_irq(&hugetlb_lock);
+
+	while (nr_zerod < nr_zero) {
+
+		if (zero_should_abort(h, nid) || fatal_signal_pending(current))
+			break;
+
+		freelist = freelist->prev;
+		if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
+			break;
+		folio = list_entry(freelist, struct folio, lru);
+
+		if (folio_test_hugetlb_zeroed(folio) ||
+		    folio_test_hugetlb_zeroing(folio))
+			continue;
+
+		folio_set_hugetlb_zeroing(folio);
+
+		/*
+		 * Incrementing this here is a bit of a fib, since
+		 * the page hasn't been cleared yet (it will be done
+		 * immediately after dropping the lock below). But
+		 * it keeps the count consistent with the overall
+		 * free count in case the page gets taken off the
+		 * freelist while we're working on it.
+		 */
+		h->free_huge_pages_zero_node[nid]++;
+		spin_unlock_irq(&hugetlb_lock);
+
+		/*
+		 * HWPoison pages may show up on the freelist.
+		 * Don't try to zero it out, but do set the flag
+		 * and counts, so that we don't consider it again.
+		 */
+		if (!folio_test_hwpoison(folio))
+			folio_zero_user(folio, 0);
+
+		cond_resched();
+
+		spin_lock_irq(&hugetlb_lock);
+		folio_set_hugetlb_zeroed(folio);
+		folio_clear_hugetlb_zeroing(folio);
+
+		/*
+		 * If the page is still on the free list, move
+		 * it to the head.
+		 */
+		if (folio_test_hugetlb_freed(folio))
+			list_move(&folio->lru, &h->hugepage_freelists[nid]);
+
+		/*
+		 * If someone was waiting for the zero to
+		 * finish, wake them up.
+		 */
+		if (waitqueue_active(&h->dqzero_wait[nid]))
+			wake_up(&h->dqzero_wait[nid]);
+		nr_zerod++;
+		freelist = &h->hugepage_freelists[nid];
+	}
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+static ssize_t zeroable_hugepages_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned int nr_zero;
+	struct hstate *h;
+	int err;
+	int nid;
+
+	if (!strcmp(buf, "max") || !strcmp(buf, "max\n")) {
+		nr_zero = UINT_MAX;
+	} else {
+		err = kstrtouint(buf, 10, &nr_zero);
+		if (err)
+			return err;
+	}
+	h = kobj_to_hstate(kobj, &nid);
+
+	zero_free_hugepages_nid(h, nid, nr_zero);
+
+	return len;
+}
+HSTATE_ATTR(zeroable_hugepages);
+
 /*
  * A subset of global hstate attributes for node devices
  */
@@ -359,6 +478,7 @@ static struct attribute *per_node_hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+	&zeroable_hugepages_attr.attr,
 	NULL,
 };
 
-- 
2.20.1

Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"

Posted by Frank van der Linden 1 month, 2 weeks ago

On Thu, Dec 25, 2025 at 12:22 AM 李喆 <lizhe.67@bytedance.com> wrote:
>
> From: Li Zhe <lizhe.67@bytedance.com>
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40 milliseconds
> for a 1G page on a recent AMD-based system).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial delay
> when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by
> these pages) starts. For 256 1G pages and 40ms per page, this would
> take 10 seconds, a noticeable delay.
>
> This patch adds a new zeroable_hugepages interface under each
> /sys/devices/system/node/node*/hugepages/hugepages-***kB directory.
> Reading it returns the number of huge folios of the corresponding size
> on that node that are eligible for pre-zeroing. The interface also
> accepts an integer x in the range [0, max], enabling user space to
> request that x huge pages be zeroed on demand.
>
> Exporting this interface offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
>
> Co-developed-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Frank van der Linden <fvdl@google.com>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  mm/hugetlb_sysfs.c | 120 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 120 insertions(+)
>
> diff --git a/mm/hugetlb_sysfs.c b/mm/hugetlb_sysfs.c
> index 79ece91406bf..8c3e433209c3 100644
> --- a/mm/hugetlb_sysfs.c
> +++ b/mm/hugetlb_sysfs.c
> @@ -352,6 +352,125 @@ struct node_hstate {
>  };
>  static struct node_hstate node_hstates[MAX_NUMNODES];
>
> +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> +                                       struct kobj_attribute *attr, char *buf)
> +{
> +       struct hstate *h;
> +       unsigned long free_huge_pages_zero;
> +       int nid;
> +
> +       h = kobj_to_hstate(kobj, &nid);
> +       if (WARN_ON(nid == NUMA_NO_NODE))
> +               return -EPERM;
> +
> +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> +                              h->free_huge_pages_zero_node[nid];
> +
> +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> +}
> +
> +static inline bool zero_should_abort(struct hstate *h, int nid)
> +{
> +       return (h->free_huge_pages_zero_node[nid] ==
> +               h->free_huge_pages_node[nid]) ||
> +               list_empty(&h->hugepage_freelists[nid]);
> +}
> +
> +static void zero_free_hugepages_nid(struct hstate *h,
> +                                  int nid, unsigned int nr_zero)
> +{
> +       struct list_head *freelist = &h->hugepage_freelists[nid];
> +       unsigned int nr_zerod = 0;
> +       struct folio *folio;
> +
> +       if (zero_should_abort(h, nid))
> +               return;
> +
> +       spin_lock_irq(&hugetlb_lock);
> +
> +       while (nr_zerod < nr_zero) {
> +
> +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> +                       break;
> +
> +               freelist = freelist->prev;
> +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> +                       break;
> +               folio = list_entry(freelist, struct folio, lru);
> +
> +               if (folio_test_hugetlb_zeroed(folio) ||
> +                   folio_test_hugetlb_zeroing(folio))
> +                       continue;
> +
> +               folio_set_hugetlb_zeroing(folio);
> +
> +               /*
> +                * Incrementing this here is a bit of a fib, since
> +                * the page hasn't been cleared yet (it will be done
> +                * immediately after dropping the lock below). But
> +                * it keeps the count consistent with the overall
> +                * free count in case the page gets taken off the
> +                * freelist while we're working on it.
> +                */
> +               h->free_huge_pages_zero_node[nid]++;
> +               spin_unlock_irq(&hugetlb_lock);
> +
> +               /*
> +                * HWPoison pages may show up on the freelist.
> +                * Don't try to zero it out, but do set the flag
> +                * and counts, so that we don't consider it again.
> +                */
> +               if (!folio_test_hwpoison(folio))
> +                       folio_zero_user(folio, 0);
> +
> +               cond_resched();
> +
> +               spin_lock_irq(&hugetlb_lock);
> +               folio_set_hugetlb_zeroed(folio);
> +               folio_clear_hugetlb_zeroing(folio);
> +
> +               /*
> +                * If the page is still on the free list, move
> +                * it to the head.
> +                */
> +               if (folio_test_hugetlb_freed(folio))
> +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> +
> +               /*
> +                * If someone was waiting for the zero to
> +                * finish, wake them up.
> +                */
> +               if (waitqueue_active(&h->dqzero_wait[nid]))
> +                       wake_up(&h->dqzero_wait[nid]);
> +               nr_zerod++;
> +               freelist = &h->hugepage_freelists[nid];
> +       }
> +       spin_unlock_irq(&hugetlb_lock);
> +}

Nit: s/nr_zerod/nr_zeroed/

Feels like the list logic can be cleaned up a bit here. Since the
zeroed folios are at the head of the list, and the dirty ones at the
tail, and you start walking from the tail, you don't need to check if
you circled back to the head - just stop if you encounter a prezeroed
folio. If you encounter a prezeroed folio while walking from the tail,
that means that all other folios from that one to the head will also
be prezeroed already.

- Frank

Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"

Posted by Li Zhe 1 month, 1 week ago

On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:

> > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > +                                       struct kobj_attribute *attr, char *buf)
> > +{
> > +       struct hstate *h;
> > +       unsigned long free_huge_pages_zero;
> > +       int nid;
> > +
> > +       h = kobj_to_hstate(kobj, &nid);
> > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > +               return -EPERM;
> > +
> > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > +                              h->free_huge_pages_zero_node[nid];
> > +
> > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > +}
> > +
> > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > +{
> > +       return (h->free_huge_pages_zero_node[nid] ==
> > +               h->free_huge_pages_node[nid]) ||
> > +               list_empty(&h->hugepage_freelists[nid]);
> > +}
> > +
> > +static void zero_free_hugepages_nid(struct hstate *h,
> > +                                  int nid, unsigned int nr_zero)
> > +{
> > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > +       unsigned int nr_zerod = 0;
> > +       struct folio *folio;
> > +
> > +       if (zero_should_abort(h, nid))
> > +               return;
> > +
> > +       spin_lock_irq(&hugetlb_lock);
> > +
> > +       while (nr_zerod < nr_zero) {
> > +
> > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > +                       break;
> > +
> > +               freelist = freelist->prev;
> > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > +                       break;
> > +               folio = list_entry(freelist, struct folio, lru);
> > +
> > +               if (folio_test_hugetlb_zeroed(folio) ||
> > +                   folio_test_hugetlb_zeroing(folio))
> > +                       continue;
> > +
> > +               folio_set_hugetlb_zeroing(folio);
> > +
> > +               /*
> > +                * Incrementing this here is a bit of a fib, since
> > +                * the page hasn't been cleared yet (it will be done
> > +                * immediately after dropping the lock below). But
> > +                * it keeps the count consistent with the overall
> > +                * free count in case the page gets taken off the
> > +                * freelist while we're working on it.
> > +                */
> > +               h->free_huge_pages_zero_node[nid]++;
> > +               spin_unlock_irq(&hugetlb_lock);
> > +
> > +               /*
> > +                * HWPoison pages may show up on the freelist.
> > +                * Don't try to zero it out, but do set the flag
> > +                * and counts, so that we don't consider it again.
> > +                */
> > +               if (!folio_test_hwpoison(folio))
> > +                       folio_zero_user(folio, 0);
> > +
> > +               cond_resched();
> > +
> > +               spin_lock_irq(&hugetlb_lock);
> > +               folio_set_hugetlb_zeroed(folio);
> > +               folio_clear_hugetlb_zeroing(folio);
> > +
> > +               /*
> > +                * If the page is still on the free list, move
> > +                * it to the head.
> > +                */
> > +               if (folio_test_hugetlb_freed(folio))
> > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > +
> > +               /*
> > +                * If someone was waiting for the zero to
> > +                * finish, wake them up.
> > +                */
> > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > +                       wake_up(&h->dqzero_wait[nid]);
> > +               nr_zerod++;
> > +               freelist = &h->hugepage_freelists[nid];
> > +       }
> > +       spin_unlock_irq(&hugetlb_lock);
> > +}
> 
> Nit: s/nr_zerod/nr_zeroed/

Thank you for the reminder. I will address this issue in v2.

> Feels like the list logic can be cleaned up a bit here. Since the
> zeroed folios are at the head of the list, and the dirty ones at the
> tail, and you start walking from the tail, you don't need to check if
> you circled back to the head - just stop if you encounter a prezeroed
> folio. If you encounter a prezeroed folio while walking from the tail,
> that means that all other folios from that one to the head will also
> be prezeroed already.

Thank you for the thoughtful suggestion. Your line of reasoning is,
in most situations, perfectly valid. Under extreme concurrency,
however, a corner case can still appear. Imagine two processes
simultaneously zeroing huge pages: Process A enters
zero_free_hugepages_nid(), completes the zeroing of one huge page,
and marks the folio in the list as pre-zeroed. Should Process B enter
the same function moments later and decide to exit as soon as it
meets a prezeroed folio, the intended parallel zeroing would quietly
fall back to a single-threaded pace.

Thanks,
Zhe

Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"

Posted by Frank van der Linden 1 month, 1 week ago

On Mon, Dec 29, 2025 at 4:26 AM Li Zhe <lizhe.67@bytedance.com> wrote:
>
> On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:
>
> > > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > > +                                       struct kobj_attribute *attr, char *buf)
> > > +{
> > > +       struct hstate *h;
> > > +       unsigned long free_huge_pages_zero;
> > > +       int nid;
> > > +
> > > +       h = kobj_to_hstate(kobj, &nid);
> > > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > > +               return -EPERM;
> > > +
> > > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > > +                              h->free_huge_pages_zero_node[nid];
> > > +
> > > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > > +}
> > > +
> > > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > > +{
> > > +       return (h->free_huge_pages_zero_node[nid] ==
> > > +               h->free_huge_pages_node[nid]) ||
> > > +               list_empty(&h->hugepage_freelists[nid]);
> > > +}
> > > +
> > > +static void zero_free_hugepages_nid(struct hstate *h,
> > > +                                  int nid, unsigned int nr_zero)
> > > +{
> > > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > > +       unsigned int nr_zerod = 0;
> > > +       struct folio *folio;
> > > +
> > > +       if (zero_should_abort(h, nid))
> > > +               return;
> > > +
> > > +       spin_lock_irq(&hugetlb_lock);
> > > +
> > > +       while (nr_zerod < nr_zero) {
> > > +
> > > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > > +                       break;
> > > +
> > > +               freelist = freelist->prev;
> > > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > > +                       break;
> > > +               folio = list_entry(freelist, struct folio, lru);
> > > +
> > > +               if (folio_test_hugetlb_zeroed(folio) ||
> > > +                   folio_test_hugetlb_zeroing(folio))
> > > +                       continue;
> > > +
> > > +               folio_set_hugetlb_zeroing(folio);
> > > +
> > > +               /*
> > > +                * Incrementing this here is a bit of a fib, since
> > > +                * the page hasn't been cleared yet (it will be done
> > > +                * immediately after dropping the lock below). But
> > > +                * it keeps the count consistent with the overall
> > > +                * free count in case the page gets taken off the
> > > +                * freelist while we're working on it.
> > > +                */
> > > +               h->free_huge_pages_zero_node[nid]++;
> > > +               spin_unlock_irq(&hugetlb_lock);
> > > +
> > > +               /*
> > > +                * HWPoison pages may show up on the freelist.
> > > +                * Don't try to zero it out, but do set the flag
> > > +                * and counts, so that we don't consider it again.
> > > +                */
> > > +               if (!folio_test_hwpoison(folio))
> > > +                       folio_zero_user(folio, 0);
> > > +
> > > +               cond_resched();
> > > +
> > > +               spin_lock_irq(&hugetlb_lock);
> > > +               folio_set_hugetlb_zeroed(folio);
> > > +               folio_clear_hugetlb_zeroing(folio);
> > > +
> > > +               /*
> > > +                * If the page is still on the free list, move
> > > +                * it to the head.
> > > +                */
> > > +               if (folio_test_hugetlb_freed(folio))
> > > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > > +
> > > +               /*
> > > +                * If someone was waiting for the zero to
> > > +                * finish, wake them up.
> > > +                */
> > > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > > +                       wake_up(&h->dqzero_wait[nid]);
> > > +               nr_zerod++;
> > > +               freelist = &h->hugepage_freelists[nid];
> > > +       }
> > > +       spin_unlock_irq(&hugetlb_lock);
> > > +}
> >
> > Nit: s/nr_zerod/nr_zeroed/
>
> Thank you for the reminder. I will address this issue in v2.
>
> > Feels like the list logic can be cleaned up a bit here. Since the
> > zeroed folios are at the head of the list, and the dirty ones at the
> > tail, and you start walking from the tail, you don't need to check if
> > you circled back to the head - just stop if you encounter a prezeroed
> > folio. If you encounter a prezeroed folio while walking from the tail,
> > that means that all other folios from that one to the head will also
> > be prezeroed already.
>
> Thank you for the thoughtful suggestion. Your line of reasoning is,
> in most situations, perfectly valid. Under extreme concurrency,
> however, a corner case can still appear. Imagine two processes
> simultaneously zeroing huge pages: Process A enters
> zero_free_hugepages_nid(), completes the zeroing of one huge page,
> and marks the folio in the list as pre-zeroed. Should Process B enter
> the same function moments later and decide to exit as soon as it
> meets a prezeroed folio, the intended parallel zeroing would quietly
> fall back to a single-threaded pace.

Hm, setting the prezeroed bit and moving the folio to the front of the
free list happens while holding hugetlb_lock. In other words, if you
encounter a folio with the prezeroed bit set while holding
hugetlb_lock, it will always be in a contiguous stretch of prezeroed
folios at the head of the free list.

Since the check for 'is this already prezeroed' is done while holding
hugetlb_lock, you know for sure that the folio is part of a list of
prezeroed folios at the head, and you can stop, right?

- Frank

Re: [PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"

Posted by Li Zhe 1 month, 1 week ago

> On Mon, 29 Dec 2025 10:57:23 -0800, fvdl@google.com wrote:
> 
> On Mon, Dec 29, 2025 at 4:26 AM Li Zhe <lizhe.67@bytedance.com> wrote:
> >
> > On Fri, 26 Dec 2025 10:51:01 -0800, fvdl@google.com wrote:
> >
> > > > +static ssize_t zeroable_hugepages_show(struct kobject *kobj,
> > > > +                                       struct kobj_attribute *attr, char *buf)
> > > > +{
> > > > +       struct hstate *h;
> > > > +       unsigned long free_huge_pages_zero;
> > > > +       int nid;
> > > > +
> > > > +       h = kobj_to_hstate(kobj, &nid);
> > > > +       if (WARN_ON(nid == NUMA_NO_NODE))
> > > > +               return -EPERM;
> > > > +
> > > > +       free_huge_pages_zero = h->free_huge_pages_node[nid] -
> > > > +                              h->free_huge_pages_zero_node[nid];
> > > > +
> > > > +       return sprintf(buf, "%lu\n", free_huge_pages_zero);
> > > > +}
> > > > +
> > > > +static inline bool zero_should_abort(struct hstate *h, int nid)
> > > > +{
> > > > +       return (h->free_huge_pages_zero_node[nid] ==
> > > > +               h->free_huge_pages_node[nid]) ||
> > > > +               list_empty(&h->hugepage_freelists[nid]);
> > > > +}
> > > > +
> > > > +static void zero_free_hugepages_nid(struct hstate *h,
> > > > +                                  int nid, unsigned int nr_zero)
> > > > +{
> > > > +       struct list_head *freelist = &h->hugepage_freelists[nid];
> > > > +       unsigned int nr_zerod = 0;
> > > > +       struct folio *folio;
> > > > +
> > > > +       if (zero_should_abort(h, nid))
> > > > +               return;
> > > > +
> > > > +       spin_lock_irq(&hugetlb_lock);
> > > > +
> > > > +       while (nr_zerod < nr_zero) {
> > > > +
> > > > +               if (zero_should_abort(h, nid) || fatal_signal_pending(current))
> > > > +                       break;
> > > > +
> > > > +               freelist = freelist->prev;
> > > > +               if (unlikely(list_is_head(freelist, &h->hugepage_freelists[nid])))
> > > > +                       break;
> > > > +               folio = list_entry(freelist, struct folio, lru);
> > > > +
> > > > +               if (folio_test_hugetlb_zeroed(folio) ||
> > > > +                   folio_test_hugetlb_zeroing(folio))
> > > > +                       continue;
> > > > +
> > > > +               folio_set_hugetlb_zeroing(folio);
> > > > +
> > > > +               /*
> > > > +                * Incrementing this here is a bit of a fib, since
> > > > +                * the page hasn't been cleared yet (it will be done
> > > > +                * immediately after dropping the lock below). But
> > > > +                * it keeps the count consistent with the overall
> > > > +                * free count in case the page gets taken off the
> > > > +                * freelist while we're working on it.
> > > > +                */
> > > > +               h->free_huge_pages_zero_node[nid]++;
> > > > +               spin_unlock_irq(&hugetlb_lock);
> > > > +
> > > > +               /*
> > > > +                * HWPoison pages may show up on the freelist.
> > > > +                * Don't try to zero it out, but do set the flag
> > > > +                * and counts, so that we don't consider it again.
> > > > +                */
> > > > +               if (!folio_test_hwpoison(folio))
> > > > +                       folio_zero_user(folio, 0);
> > > > +
> > > > +               cond_resched();
> > > > +
> > > > +               spin_lock_irq(&hugetlb_lock);
> > > > +               folio_set_hugetlb_zeroed(folio);
> > > > +               folio_clear_hugetlb_zeroing(folio);
> > > > +
> > > > +               /*
> > > > +                * If the page is still on the free list, move
> > > > +                * it to the head.
> > > > +                */
> > > > +               if (folio_test_hugetlb_freed(folio))
> > > > +                       list_move(&folio->lru, &h->hugepage_freelists[nid]);
> > > > +
> > > > +               /*
> > > > +                * If someone was waiting for the zero to
> > > > +                * finish, wake them up.
> > > > +                */
> > > > +               if (waitqueue_active(&h->dqzero_wait[nid]))
> > > > +                       wake_up(&h->dqzero_wait[nid]);
> > > > +               nr_zerod++;
> > > > +               freelist = &h->hugepage_freelists[nid];
> > > > +       }
> > > > +       spin_unlock_irq(&hugetlb_lock);
> > > > +}
> > >
> > > Nit: s/nr_zerod/nr_zeroed/
> >
> > Thank you for the reminder. I will address this issue in v2.
> >
> > > Feels like the list logic can be cleaned up a bit here. Since the
> > > zeroed folios are at the head of the list, and the dirty ones at the
> > > tail, and you start walking from the tail, you don't need to check if
> > > you circled back to the head - just stop if you encounter a prezeroed
> > > folio. If you encounter a prezeroed folio while walking from the tail,
> > > that means that all other folios from that one to the head will also
> > > be prezeroed already.
> >
> > Thank you for the thoughtful suggestion. Your line of reasoning is,
> > in most situations, perfectly valid. Under extreme concurrency,
> > however, a corner case can still appear. Imagine two processes
> > simultaneously zeroing huge pages: Process A enters
> > zero_free_hugepages_nid(), completes the zeroing of one huge page,
> > and marks the folio in the list as pre-zeroed. Should Process B enter
> > the same function moments later and decide to exit as soon as it
> > meets a prezeroed folio, the intended parallel zeroing would quietly
> > fall back to a single-threaded pace.
> 
> Hm, setting the prezeroed bit and moving the folio to the front of the
> free list happens while holding hugetlb_lock. In other words, if you
> encounter a folio with the prezeroed bit set while holding
> hugetlb_lock, it will always be in a contiguous stretch of prezeroed
> folios at the head of the free list.
> 
> Since the check for 'is this already prezeroed' is done while holding
> hugetlb_lock, you know for sure that the folio is part of a list of
> prezeroed folios at the head, and you can stop, right?

Sorry for the confusion earlier. You're right, this does make
zero_free_hugepages_nid() simpler. I'll update it in v2.

Thanks,
Zhe

[PATCH 1/8] mm/hugetlb: add pre-zeroed framework
[PATCH 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio()
[PATCH 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue
[PATCH 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
[PATCH 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
[PATCH 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer
[PATCH 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages"
[PATCH 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify()