From nobody Thu Apr 2 19:15:27 2026 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 011B21DEFE8 for ; Thu, 12 Feb 2026 00:37:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770856660; cv=none; b=gXakgJeDefP0pokXaarO2iNAJ5fCSO+iIoaYlciOEavEW5GxmukaMBBhmpymg3I5xogXfvvrC/EvaqwXZ8kAcJpyhPaf3GAbZDfNQQJ6lXKCLsU4iNKKy/9oAb1nks7jB/+tReop51Im23Aaf2VMVG7VRRG1aUvfn7mfsg5OpJU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770856660; c=relaxed/simple; bh=EwNfUsWZF1vGarE+OLV6nbe8sQt1PL8TvcUOE7m4PpA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=g+DFpBISSqfo5UtiwJb1bjefb68jZa5Zn+yhsGQX4m8JpcXmRiuZqJ8KMD+w950YOuGoj9wZuIlsdRhb61homqlxCdTjtaAVvyelIb1ILuwIjgXea47nVaEppQSzXHHX+zj5M7iPEsK057hlc4iltmPJixN4ggq7JGQOjdbo6yI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=k5t2Oeb0; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--ackerleytng.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="k5t2Oeb0" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2a92a3f5de9so16550145ad.2 for ; Wed, 11 Feb 2026 16:37:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1770856657; x=1771461457; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=mW1n+uSYA6jLOEsvKLlQWfWVUyuDqSU/loYnlj1zbjs=; b=k5t2Oeb0fweDj+TO75fIWcU2fwhjgzi5B2WSRoMz0yE2aAHR+zg5O6gVjU4SnJRFVA qn5PxHwyGKbSUpDvLrOFwFPzciUvG/EpIXNdrlsc41gF0As2E3lSL3/kM0NgTreHPIZX a200gntpy5yqIQIu4B+fHJI7jL8VAzphOGG3v92usfuzDjgkZ4tZDPfNH8b0UDJ4EOnC BGxSz/CDlNjfsmg4gzfx71ZZBx88NhZFPXRLLfDtvwBvv+hgyE6+uBqbYGO28iRtFt1Y CtWbpGLZckdJwplkgxurro6XQawart2jR3V4HMvZGdbCJFdOVCQW5U/pfxI96r5+ZTK6 Y0Jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770856657; x=1771461457; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mW1n+uSYA6jLOEsvKLlQWfWVUyuDqSU/loYnlj1zbjs=; b=wKGLUN/Hsy1CRjzuo78DZedCzN4W/ni25pvt8Nnw2NOcby+ca7Bl6o2d5ZBjvIFNY2 qsmutzoN34wXYFbM8UxPmy3/skbZkfieyz2BaXo+n3LSprsr0zl5cgBzsdBubajALyKG XglOSKP1AVsG9c0OkuAlGmB5pEmG7PAq2EsI8NOv4BfUCR6Rjd2+n0fnTXnrb09lu+Wb CEOaX1zUWITdbzv5lEsLwvtwoH04hekQ/pofwNy9Hqb2yTAkNqMko3rX9GEV8Ln9PPsJ dGOHAtpjgIBXWxr6nm/d43jReEp04AONpi2FaNEtiiHEsAkAqPCX8Ls7Alx+4sMGjzgD ruWQ== X-Forwarded-Encrypted: i=1; AJvYcCVdLnbYxV4ugEsCgMMkBwBXXHyJMaRSFVztOSzVtLllcDgcnna6JmVKF/CQ/i1v7H+O+DM0CVaAyggVstE=@vger.kernel.org X-Gm-Message-State: AOJu0YzVCgDK/S7BVAsQNnmbWp1IvpfNx62Bsti6QkABbwKQeXhUeMhs 6oH9z5sgT9/TosG6CDXS/jx3nv0PVgJumIDlGaei4TeuvUgNCtSf91yu+X37dOwN5HaTKzoXL96 aMBxozDlK3IrFCJlpbfVEiRyA8A== X-Received: from plbla13.prod.google.com ([2002:a17:902:fa0d:b0:2a9:622c:47d6]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:9cf:b0:2a7:d5c0:c659 with SMTP id d9443c01a7336-2ab3b1581a3mr5091465ad.5.1770856657375; Wed, 11 Feb 2026 16:37:37 -0800 (PST) Date: Wed, 11 Feb 2026 16:37:18 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.53.0.310.g728cabbaf7-goog Message-ID: <3803e96be57ab3201ab967ba47af22d12024f9e1.1770854662.git.ackerleytng@google.com> Subject: [RFC PATCH v1 7/7] mm: hugetlb: Refactor out hugetlb_alloc_folio() From: Ackerley Tng To: akpm@linux-foundation.org, dan.j.williams@intel.com, david@kernel.org, fvdl@google.com, hannes@cmpxchg.org, jgg@nvidia.com, jiaqiyan@google.com, jthoughton@google.com, kalyazin@amazon.com, mhocko@kernel.org, michael.roth@amd.com, muchun.song@linux.dev, osalvador@suse.de, pasha.tatashin@soleen.com, pbonzini@redhat.com, peterx@redhat.com, pratyush@kernel.org, rick.p.edgecombe@intel.com, rientjes@google.com, roman.gushchin@linux.dev, seanjc@google.com, shakeel.butt@linux.dev, shivankg@amd.com, vannapurve@google.com, yan.y.zhao@intel.com Cc: ackerleytng@google.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which handles allocation of a folio and memory and HugeTLB charging to cgroups. Other than flags to control charging, hugetlb_alloc_folio() also takes parameters for memory policy and memcg to charge memory to. This refactoring decouples the HugeTLB page allocation from VMAs, specifically: 1. Reservations (as in resv_map) are stored in the vma 2. mpol is stored at vma->vm_policy 3. A vma must be used for allocation even if the pages are not meant to be used by host process. Without this coupling, VMAs are no longer a requirement for allocation. This opens up the allocation routine for usage without VMAs, which will allow guest_memfd to use HugeTLB as a more generic allocator of huge pages, since guest_memfd memory may not have any associated VMAs by design. In addition, direct allocations from HugeTLB could possibly be refactored to avoid the use of a pseudo-VMA. Also, this decouples HugeTLB page allocation from HugeTLBfs, where the subpool is stored at the fs mount. This is also a requirement for guest_memfd, where the plan is to have a subpool created per-fd and stored on the inode. No functional change intended. Signed-off-by: Ackerley Tng --- include/linux/hugetlb.h | 11 +++ mm/hugetlb.c | 201 +++++++++++++++++++++++----------------- 2 files changed, 126 insertions(+), 86 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e51b8ef0cebd9..e385945c04af0 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -704,6 +704,9 @@ bool hugetlb_bootmem_page_zones_valid(int nid, struct h= uge_bootmem_page *m); int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *= list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long en= d_pfn); void wait_for_freed_hugetlb_folios(void); +struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol, + int nid, nodemask_t *nodemask, struct mem_cgroup *memcg, + bool charge_hugetlb_rsvd, bool use_existing_reservation); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred= _nid, @@ -1115,6 +1118,14 @@ static inline void wait_for_freed_hugetlb_folios(voi= d) { } =20 +static inline struct folio *hugetlb_alloc_folio(struct hstate *h, + struct mempolicy *mpol, int nid, nodemask_t *nodemask, + struct mem_cgroup *memcg, bool charge_hugetlb_rsvd, + bool use_existing_reservation) +{ + return NULL; +} + static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 70e91edc47dc1..c6cfb268a527a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2844,6 +2844,105 @@ void wait_for_freed_hugetlb_folios(void) flush_work(&free_hpage_work); } =20 +/** + * hugetlb_alloc_folio() - Allocates a hugetlb folio. + * + * @h: struct hstate to allocate from. + * @mpol: struct mempolicy to apply for this folio allocation. + * Caller must hold reference to mpol. + * @nid: Node id, used together with mpol to determine folio allocation. + * @nodemask: Nodemask, used together with mpol to determine folio allocat= ion. + * @memcg: Memory cgroup to charge for memory usage. + * Caller must hold reference on memcg. + * @charge_hugetlb_rsvd: Set to true to charge hugetlb reservations in cgr= oup. + * @use_existing_reservation: Set to true if this allocation should use an + * existing hstate reservation. + * + * This function handles cgroup and global hstate reservations. VMA-related + * reservations and subpool debiting must be handled by the caller if nece= ssary. + * + * Return: folio on success or negated error otherwise. + */ +struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol, + int nid, nodemask_t *nodemask, struct mem_cgroup *memcg, + bool charge_hugetlb_rsvd, bool use_existing_reservation) +{ + size_t nr_pages =3D pages_per_huge_page(h); + struct hugetlb_cgroup *h_cg =3D NULL; + gfp_t gfp =3D htlb_alloc_mask(h); + bool memory_charged =3D false; + int idx =3D hstate_index(h); + struct folio *folio; + int ret; + + if (charge_hugetlb_rsvd) { + if (hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg)) + return ERR_PTR(-ENOSPC); + } + + if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg)) { + ret =3D -ENOSPC; + goto out_uncharge_hugetlb_page_count; + } + + ret =3D mem_cgroup_hugetlb_try_charge(memcg, gfp | __GFP_RETRY_MAYFAIL, + nr_pages); + if (ret =3D=3D -ENOMEM) + goto out_uncharge_memory; + + memory_charged =3D !ret; + + spin_lock_irq(&hugetlb_lock); + + folio =3D NULL; + if (use_existing_reservation || available_huge_pages(h)) + folio =3D dequeue_hugetlb_folio_with_mpol(h, mpol, nid, nodemask); + + if (!folio) { + spin_unlock_irq(&hugetlb_lock); + folio =3D alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask); + if (!folio) { + ret =3D -ENOSPC; + goto out_uncharge_memory; + } + spin_lock_irq(&hugetlb_lock); + list_add(&folio->lru, &h->hugepage_activelist); + folio_ref_unfreeze(folio, 1); + /* Fall through */ + } + + if (use_existing_reservation) { + folio_set_hugetlb_restore_reserve(folio); + h->resv_huge_pages--; + } + + hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio); + + if (charge_hugetlb_rsvd) + hugetlb_cgroup_commit_charge_rsvd(idx, nr_pages, h_cg, folio); + + spin_unlock_irq(&hugetlb_lock); + + lruvec_stat_mod_folio(folio, NR_HUGETLB, nr_pages); + + if (memory_charged) + mem_cgroup_commit_charge(folio, memcg); + + return folio; + +out_uncharge_memory: + if (memory_charged) + mem_cgroup_cancel_charge(memcg, nr_pages); + + hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg); + +out_uncharge_hugetlb_page_count: + if (charge_hugetlb_rsvd) + hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg); + + return ERR_PTR(ret); +} + typedef enum { /* * For either 0/1: we checked the per-vma resv map, and one resv @@ -2878,17 +2977,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_st= ruct *vma, struct folio *folio; long retval, gbl_chg, gbl_reserve; map_chg_state map_chg; - int ret, idx; - struct hugetlb_cgroup *h_cg =3D NULL; gfp_t gfp =3D htlb_alloc_mask(h); - bool memory_charged =3D false; + bool charge_hugetlb_rsvd; + bool use_existing_reservation; struct mem_cgroup *memcg; struct mempolicy *mpol; nodemask_t *nodemask; int nid; =20 - idx =3D hstate_index(h); - /* Whether we need a separate per-vma reservation? */ if (cow_from_owner) { /* @@ -2920,7 +3016,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_stru= ct *vma, if (map_chg) { gbl_chg =3D hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) { - ret =3D -ENOSPC; + folio =3D ERR_PTR(-ENOSPC); goto out_end_reservation; } } else { @@ -2935,85 +3031,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_st= ruct *vma, * If this allocation is not consuming a per-vma reservation, * charge the hugetlb cgroup now. */ - if (map_chg) { - ret =3D hugetlb_cgroup_charge_cgroup_rsvd( - idx, pages_per_huge_page(h), &h_cg); - if (ret) { - ret =3D -ENOSPC; - goto out_subpool_put; - } - } + charge_hugetlb_rsvd =3D (bool)map_chg; =20 - ret =3D hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); - if (ret) { - ret =3D -ENOSPC; - goto out_uncharge_cgroup_reservation; - } + /* + * gbl_chg =3D=3D 0 indicates a reservation exists for the allocation, so + * try to use it. + */ + use_existing_reservation =3D gbl_chg =3D=3D 0; =20 memcg =3D get_mem_cgroup_from_current(); - ret =3D mem_cgroup_hugetlb_try_charge(memcg, gfp | __GFP_RETRY_MAYFAIL, - pages_per_huge_page(h)); - if (ret =3D=3D -ENOMEM) - goto out_put_memcg; - - memory_charged =3D !ret; - - spin_lock_irq(&hugetlb_lock); =20 /* Takes reference on mpol. */ nid =3D huge_node(vma, addr, gfp, &mpol, &nodemask); =20 - /* - * gbl_chg =3D=3D 0 indicates a reservation exists for the allocation - so - * try dequeuing a page. If there are available_huge_pages(), try using - * them! - */ - folio =3D NULL; - if (!gbl_chg || available_huge_pages(h)) - folio =3D dequeue_hugetlb_folio_with_mpol(h, mpol, nid, nodemask); - - if (!folio) { - spin_unlock_irq(&hugetlb_lock); - folio =3D alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask); - if (!folio) { - mpol_cond_put(mpol); - ret =3D -ENOSPC; - goto out_uncharge_memory; - } - spin_lock_irq(&hugetlb_lock); - list_add(&folio->lru, &h->hugepage_activelist); - folio_ref_unfreeze(folio, 1); - /* Fall through */ - } + folio =3D hugetlb_alloc_folio(h, mpol, nid, nodemask, memcg, + charge_hugetlb_rsvd, + use_existing_reservation); =20 mpol_cond_put(mpol); =20 - /* - * Either dequeued or buddy-allocated folio needs to add special - * mark to the folio when it consumes a global reservation. - */ - if (!gbl_chg) { - folio_set_hugetlb_restore_reserve(folio); - h->resv_huge_pages--; - } - - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio); - /* If allocation is not consuming a reservation, also store the - * hugetlb_cgroup pointer on the page. - */ - if (map_chg) { - hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h), - h_cg, folio); - } - - spin_unlock_irq(&hugetlb_lock); - - lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h)); - - if (memory_charged) - mem_cgroup_commit_charge(folio, memcg); mem_cgroup_put(memcg); =20 + if (IS_ERR(folio)) + goto out_subpool_put; + hugetlb_set_folio_subpool(folio, spool); =20 if (map_chg !=3D MAP_CHG_ENFORCED) { @@ -3046,17 +3087,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_str= uct *vma, =20 return folio; =20 -out_uncharge_memory: - if (memory_charged) - mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h)); -out_put_memcg: - mem_cgroup_put(memcg); - - hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); -out_uncharge_cgroup_reservation: - if (map_chg) - hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h), - h_cg); out_subpool_put: /* * put page to subpool iff the quota of subpool's rsv_hpages is used @@ -3067,11 +3097,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_st= ruct *vma, hugetlb_acct_memory(h, -gbl_reserve); } =20 - out_end_reservation: if (map_chg !=3D MAP_CHG_ENFORCED) vma_end_reservation(h, vma, addr); - return ERR_PTR(ret); + return folio; } =20 static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exa= ct) --=20 2.53.0.310.g728cabbaf7-goog