From nobody Tue Dec 16 14:49:54 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1114B340DA6 for ; Fri, 5 Dec 2025 19:44:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764963845; cv=none; b=Ei4CfKm+lZubSeFD46hyYD89o4KIfu3o9k13flXhB/0v2wDF7kYPWbZHPhPr/SLYyzbgby+USO4gUfv8ikBD9tLItDxZ3CjuNOv6OAAGeCZMcOtASCIWnJEPKGofKgwJYiTJ5z6htQ31xJEGQbT66PteYXNIg2ePU0nj4VhwEfo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764963845; c=relaxed/simple; bh=oMYy6glwIW5pXD0zPIrvTMQhQQvrPM7gnzZZsvkFTM0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cMYihIwTomUwrGrDfnnrn43SEShNQ4SNXlrOy6fy2380w+iWEPdv9hen+iTfs7xn7qJr3woo8Cv7gCkFkkF7KoI0DPI296nIyeARj1zgvGSuoQx0VVnv4bpXVKp/ox4IeAcBokAbIx0udBz6K65Ns7SkZ39PiQSY8fFYOhJ/bFI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=SLRAkAaN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="SLRAkAaN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 42B8EC19423; Fri, 5 Dec 2025 19:44:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1764963843; bh=oMYy6glwIW5pXD0zPIrvTMQhQQvrPM7gnzZZsvkFTM0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=SLRAkAaNAeqUmCE7ZfWJfkGF2yhvsrstkT+8jFX9GctkmfcAYklZGF1SnkcudStrm ru8aTvxZcBj7tKbu6gjH156+qiDO0qgpOihmoQDwcGfQ0WPge3EE7mQbizT96VVy5y b+Q2CeksMHH4nIByKGb6tNrb5TJrhGdvnlHByIyQg32Cl7KI+GBa4X/Y9s5f123F4y D1Ouh2k4/isKnReUNfQ3vbWnQBbDorTSwpC8Qt3K7zoyXJggtCkPnrOBE1Jk/yyrP6 atVJQI2X3bdl38um/6ueHY0xJNS9IYLjWCruit/h2aMGiPST5i8sNPc4zfAu5gDffu zyGZbZ3FNhecg== Received: from phl-compute-11.internal (phl-compute-11.internal [10.202.2.51]) by mailfauth.phl.internal (Postfix) with ESMTP id 87006F40072; Fri, 5 Dec 2025 14:44:02 -0500 (EST) Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-11.internal (MEProxy); Fri, 05 Dec 2025 14:44:02 -0500 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdelvdegucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh ephffvvefufffkofgjfhgggfestdekredtredttdenucfhrhhomhepmfhirhihlhcuufhh uhhtshgvmhgruhcuoehkrghssehkvghrnhgvlhdrohhrgheqnecuggftrfgrthhtvghrnh ephfdufeejhefhkedtuedvfeevjeffvdfhvedtudfgudffjeefieekleehvdetvdevnecu vehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepkhhirhhilh hlodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduieduudeivdeiheehqddv keeggeegjedvkedqkhgrsheppehkvghrnhgvlhdrohhrghesshhhuhhtvghmohhvrdhnrg hmvgdpnhgspghrtghpthhtohepudelpdhmohguvgepshhmthhpohhuthdprhgtphhtthho pegrkhhpmheslhhinhhugidqfhhouhhnuggrthhiohhnrdhorhhgpdhrtghpthhtohepmh hutghhuhhnrdhsohhngheslhhinhhugidruggvvhdprhgtphhtthhopegurghvihgusehk vghrnhgvlhdrohhrghdprhgtphhtthhopehoshgrlhhvrgguohhrsehsuhhsvgdruggvpd hrtghpthhtoheprhhpphhtsehkvghrnhgvlhdrohhrghdprhgtphhtthhopehvsggrsghk rgesshhushgvrdgtiidprhgtphhtthhopehlohhrvghniihordhsthhorghkvghssehorh grtghlvgdrtghomhdprhgtphhtthhopeifihhllhihsehinhhfrhgruggvrggurdhorhhg pdhrtghpthhtohepiihihiesnhhvihguihgrrdgtohhm X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 5 Dec 2025 14:44:01 -0500 (EST) From: Kiryl Shutsemau To: Andrew Morton , Muchun Song Cc: David Hildenbrand , Oscar Salvador , Mike Rapoport , Vlastimil Babka , Lorenzo Stoakes , Matthew Wilcox , Zi Yan , Baoquan He , Michal Hocko , Johannes Weiner , Jonathan Corbet , Usama Arif , kernel-team@meta.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Kiryl Shutsemau Subject: [PATCH 05/11] mm/hugetlb: Refactor code around vmemmap_walk Date: Fri, 5 Dec 2025 19:43:41 +0000 Message-ID: <20251205194351.1646318-6-kas@kernel.org> X-Mailer: git-send-email 2.51.2 In-Reply-To: <20251205194351.1646318-1-kas@kernel.org> References: <20251205194351.1646318-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" To prepare for removing fake head pages, the vmemmap_walk code is being rew= orked. The reuse_page and reuse_addr variables are being eliminated. There will no longer be an expectation regarding the reuse address in relation to the operated range. Instead, the caller will provide head and tail vmemmap pages, along with the vmemmap_start address where the head page is located. Currently, vmemmap_head and vmemmap_tail are set to the same page, but this will change in the future. The only functional change is that __hugetlb_vmemmap_optimize_folio() will abandon optimization if memory allocation fails. Signed-off-by: Kiryl Shutsemau --- mm/hugetlb_vmemmap.c | 184 ++++++++++++++++++------------------------- 1 file changed, 77 insertions(+), 107 deletions(-) diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index ba0fb1b6a5a8..f5ee499b8563 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -24,8 +24,9 @@ * * @remap_pte: called for each lowest-level entry (PTE). * @nr_walked: the number of walked pte. - * @reuse_page: the page which is reused for the tail vmemmap pages. - * @reuse_addr: the virtual address of the @reuse_page page. + * @vmemmap_start: the start of vmemmap range, where head page is located + * @vmemmap_head: the page to be installed as first in the vmemmap range + * @vmemmap_tail: the page to be installed as non-first in the vmemmap ran= ge * @vmemmap_pages: the list head of the vmemmap pages that can be freed * or is mapped from. * @flags: used to modify behavior in vmemmap page table walking @@ -34,11 +35,14 @@ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk); + unsigned long nr_walked; - struct page *reuse_page; - unsigned long reuse_addr; + unsigned long vmemmap_start; + struct page *vmemmap_head; + struct page *vmemmap_tail; struct list_head *vmemmap_pages; =20 + /* Skip the TLB flush when we split the PMD */ #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0) /* Skip the TLB flush when we remap the PTE */ @@ -140,14 +144,7 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long= addr, { struct vmemmap_remap_walk *vmemmap_walk =3D walk->private; =20 - /* - * The reuse_page is found 'first' in page table walking before - * starting remapping. - */ - if (!vmemmap_walk->reuse_page) - vmemmap_walk->reuse_page =3D pte_page(ptep_get(pte)); - else - vmemmap_walk->remap_pte(pte, addr, vmemmap_walk); + vmemmap_walk->remap_pte(pte, addr, vmemmap_walk); vmemmap_walk->nr_walked++; =20 return 0; @@ -207,18 +204,12 @@ static void free_vmemmap_page_list(struct list_head *= list) static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk) { - /* - * Remap the tail pages as read-only to catch illegal write operation - * to the tail pages. - */ - pgprot_t pgprot =3D PAGE_KERNEL_RO; struct page *page =3D pte_page(ptep_get(pte)); pte_t entry; =20 /* Remapping the head page requires r/w */ - if (unlikely(addr =3D=3D walk->reuse_addr)) { - pgprot =3D PAGE_KERNEL; - list_del(&walk->reuse_page->lru); + if (unlikely(addr =3D=3D walk->vmemmap_start)) { + list_del(&walk->vmemmap_head->lru); =20 /* * Makes sure that preceding stores to the page contents from @@ -226,9 +217,16 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned lon= g addr, * write. */ smp_wmb(); + + entry =3D mk_pte(walk->vmemmap_head, PAGE_KERNEL); + } else { + /* + * Remap the tail pages as read-only to catch illegal write + * operation to the tail pages. + */ + entry =3D mk_pte(walk->vmemmap_tail, PAGE_KERNEL_RO); } =20 - entry =3D mk_pte(walk->reuse_page, pgprot); list_add(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } @@ -255,16 +253,13 @@ static inline void reset_struct_pages(struct page *st= art) static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk) { - pgprot_t pgprot =3D PAGE_KERNEL; struct page *page; void *to; =20 - BUG_ON(pte_page(ptep_get(pte)) !=3D walk->reuse_page); - page =3D list_first_entry(walk->vmemmap_pages, struct page, lru); list_del(&page->lru); to =3D page_to_virt(page); - copy_page(to, (void *)walk->reuse_addr); + copy_page(to, (void *)walk->vmemmap_start); reset_struct_pages(to); =20 /* @@ -272,7 +267,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned lo= ng addr, * before the set_pte_at() write. */ smp_wmb(); - set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); } =20 /** @@ -282,22 +277,17 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned = long addr, * to remap. * @end: end address of the vmemmap virtual address range that we wa= nt to * remap. - * @reuse: reuse address. - * * Return: %0 on success, negative error code otherwise. */ -static int vmemmap_remap_split(unsigned long start, unsigned long end, - unsigned long reuse) +static int vmemmap_remap_split(unsigned long start, unsigned long end) { struct vmemmap_remap_walk walk =3D { .remap_pte =3D NULL, + .vmemmap_start =3D start, .flags =3D VMEMMAP_SPLIT_NO_TLB_FLUSH, }; =20 - /* See the comment in the vmemmap_remap_free(). */ - BUG_ON(start - reuse !=3D PAGE_SIZE); - - return vmemmap_remap_range(reuse, end, &walk); + return vmemmap_remap_range(start, end, &walk); } =20 /** @@ -308,7 +298,8 @@ static int vmemmap_remap_split(unsigned long start, uns= igned long end, * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. + * @vmemmap_head: the page to be installed as first in the vmemmap range + * @vmemmap_tail: the page to be installed as non-first in the vmemmap ran= ge * @vmemmap_pages: list to deposit vmemmap pages to be freed. It is calle= rs * responsibility to free pages. * @flags: modifications to vmemmap_remap_walk flags @@ -316,69 +307,40 @@ static int vmemmap_remap_split(unsigned long start, u= nsigned long end, * Return: %0 on success, negative error code otherwise. */ static int vmemmap_remap_free(unsigned long start, unsigned long end, - unsigned long reuse, + struct page *vmemmap_head, + struct page *vmemmap_tail, struct list_head *vmemmap_pages, unsigned long flags) { int ret; struct vmemmap_remap_walk walk =3D { .remap_pte =3D vmemmap_remap_pte, - .reuse_addr =3D reuse, + .vmemmap_start =3D start, + .vmemmap_head =3D vmemmap_head, + .vmemmap_tail =3D vmemmap_tail, .vmemmap_pages =3D vmemmap_pages, .flags =3D flags, }; - int nid =3D page_to_nid((struct page *)reuse); - gfp_t gfp_mask =3D GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; + + ret =3D vmemmap_remap_range(start, end, &walk); + if (!ret || !walk.nr_walked) + return ret; + + end =3D start + walk.nr_walked * PAGE_SIZE; =20 /* - * Allocate a new head vmemmap page to avoid breaking a contiguous - * block of struct page memory when freeing it back to page allocator - * in free_vmemmap_page_list(). This will allow the likely contiguous - * struct page backing memory to be kept contiguous and allowing for - * more allocations of hugepages. Fallback to the currently - * mapped head page in case should it fail to allocate. + * vmemmap_pages contains pages from the previous vmemmap_remap_range() + * call which failed. These are pages which were removed from + * the vmemmap. They will be restored in the following call. */ - walk.reuse_page =3D alloc_pages_node(nid, gfp_mask, 0); - if (walk.reuse_page) { - copy_page(page_to_virt(walk.reuse_page), - (void *)walk.reuse_addr); - list_add(&walk.reuse_page->lru, vmemmap_pages); - memmap_pages_add(1); - } + walk =3D (struct vmemmap_remap_walk) { + .remap_pte =3D vmemmap_restore_pte, + .vmemmap_start =3D start, + .vmemmap_pages =3D vmemmap_pages, + .flags =3D 0, + }; =20 - /* - * In order to make remapping routine most efficient for the huge pages, - * the routine of vmemmap page table walking has the following rules - * (see more details from the vmemmap_pte_range()): - * - * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) - * should be continuous. - * - The @reuse address is part of the range [@reuse, @end) that we are - * walking which is passed to vmemmap_remap_range(). - * - The @reuse address is the first in the complete range. - * - * So we need to make sure that @start and @reuse meet the above rules. - */ - BUG_ON(start - reuse !=3D PAGE_SIZE); - - ret =3D vmemmap_remap_range(reuse, end, &walk); - if (ret && walk.nr_walked) { - end =3D reuse + walk.nr_walked * PAGE_SIZE; - /* - * vmemmap_pages contains pages from the previous - * vmemmap_remap_range call which failed. These - * are pages which were removed from the vmemmap. - * They will be restored in the following call. - */ - walk =3D (struct vmemmap_remap_walk) { - .remap_pte =3D vmemmap_restore_pte, - .reuse_addr =3D reuse, - .vmemmap_pages =3D vmemmap_pages, - .flags =3D 0, - }; - - vmemmap_remap_range(reuse, end, &walk); - } + vmemmap_remap_range(start + PAGE_SIZE, end, &walk); =20 return ret; } @@ -415,29 +377,27 @@ static int alloc_vmemmap_page_list(unsigned long star= t, unsigned long end, * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. * @flags: modifications to vmemmap_remap_walk flags * * Return: %0 on success, negative error code otherwise. */ static int vmemmap_remap_alloc(unsigned long start, unsigned long end, - unsigned long reuse, unsigned long flags) + unsigned long flags) { LIST_HEAD(vmemmap_pages); struct vmemmap_remap_walk walk =3D { .remap_pte =3D vmemmap_restore_pte, - .reuse_addr =3D reuse, + .vmemmap_start =3D start, .vmemmap_pages =3D &vmemmap_pages, .flags =3D flags, }; =20 - /* See the comment in the vmemmap_remap_free(). */ - BUG_ON(start - reuse !=3D PAGE_SIZE); + start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; =20 if (alloc_vmemmap_page_list(start, end, &vmemmap_pages)) return -ENOMEM; =20 - return vmemmap_remap_range(reuse, end, &walk); + return vmemmap_remap_range(start, end, &walk); } =20 DEFINE_STATIC_KEY_FALSE(hugetlb_optimize_vmemmap_key); @@ -454,8 +414,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct= hstate *h, struct folio *folio, unsigned long flags) { int ret; - unsigned long vmemmap_start =3D (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); @@ -466,9 +425,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct= hstate *h, if (flags & VMEMMAP_SYNCHRONIZE_RCU) synchronize_rcu(); =20 + vmemmap_start =3D (unsigned long)folio; vmemmap_end =3D vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse =3D vmemmap_start; - vmemmap_start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; =20 /* * The pages which the vmemmap virtual address range [@vmemmap_start, @@ -477,7 +435,7 @@ static int __hugetlb_vmemmap_restore_folio(const struct= hstate *h, * When a HugeTLB page is freed to the buddy allocator, previously * discarded vmemmap pages must be allocated and remapping. */ - ret =3D vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, fl= ags); + ret =3D vmemmap_remap_alloc(vmemmap_start, vmemmap_end, flags); if (!ret) { folio_clear_hugetlb_vmemmap_optimized(folio); static_branch_dec(&hugetlb_optimize_vmemmap_key); @@ -565,9 +523,9 @@ static int __hugetlb_vmemmap_optimize_folio(const struc= t hstate *h, struct list_head *vmemmap_pages, unsigned long flags) { - int ret =3D 0; - unsigned long vmemmap_start =3D (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; + struct page *vmemmap_head, *vmemmap_tail; + int nid, ret =3D 0; =20 VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); @@ -592,9 +550,21 @@ static int __hugetlb_vmemmap_optimize_folio(const stru= ct hstate *h, */ folio_set_hugetlb_vmemmap_optimized(folio); =20 + nid =3D folio_nid(folio); + vmemmap_head =3D alloc_pages_node(nid, GFP_KERNEL, 0); + + if (!vmemmap_head) { + ret =3D -ENOMEM; + goto out; + } + + copy_page(page_to_virt(vmemmap_head), folio); + list_add(&vmemmap_head->lru, vmemmap_pages); + memmap_pages_add(1); + + vmemmap_tail =3D vmemmap_head; + vmemmap_start =3D (unsigned long)folio; vmemmap_end =3D vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse =3D vmemmap_start; - vmemmap_start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; =20 /* * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) @@ -602,8 +572,10 @@ static int __hugetlb_vmemmap_optimize_folio(const stru= ct hstate *h, * mapping the range to vmemmap_pages list so that they can be freed by * the caller. */ - ret =3D vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, + ret =3D vmemmap_remap_free(vmemmap_start, vmemmap_end, + vmemmap_head, vmemmap_tail, vmemmap_pages, flags); +out: if (ret) { static_branch_dec(&hugetlb_optimize_vmemmap_key); folio_clear_hugetlb_vmemmap_optimized(folio); @@ -632,21 +604,19 @@ void hugetlb_vmemmap_optimize_folio(const struct hsta= te *h, struct folio *folio) =20 static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct foli= o *folio) { - unsigned long vmemmap_start =3D (unsigned long)&folio->page, vmemmap_end; - unsigned long vmemmap_reuse; + unsigned long vmemmap_start, vmemmap_end; =20 if (!vmemmap_should_optimize_folio(h, folio)) return 0; =20 + vmemmap_start =3D (unsigned long)folio; vmemmap_end =3D vmemmap_start + hugetlb_vmemmap_size(h); - vmemmap_reuse =3D vmemmap_start; - vmemmap_start +=3D HUGETLB_VMEMMAP_RESERVE_SIZE; =20 /* * Split PMDs on the vmemmap virtual address range [@vmemmap_start, * @vmemmap_end] */ - return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse); + return vmemmap_remap_split(vmemmap_start, vmemmap_end); } =20 static void __hugetlb_vmemmap_optimize_folios(struct hstate *h, --=20 2.51.2