From nobody Sat Feb 7 07:24:38 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 386A431A570 for ; Mon, 5 Jan 2026 16:17:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767629878; cv=none; b=UZcwQbT6gu97MFheD0AQWB+F5p9hNBJxzlWtluATpSYlkKXWZDbwt7QjXl9NJYMiDFzevnjYn+Qy7vxEnAS23S6DA/zxfSbaPkbZjAhretYtKkJA5umdda2ijx7pCzhHllSlqxgfk9jb+7+Kpbh8Tr5X6S3foO4J6NzptZpUtLI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767629878; c=relaxed/simple; bh=w5BSmHk4MY1Pu4kb7pBakslsVsUOyA8YMtlHxq+4BJg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lBBxIn7mZTnsx5zLwmHaZC47UlSeDjD6icE+adp1280pp4rQI8uhByqSZjmwfGjDMQpAZCo/mcdObCSJdTi5AzQlz1upw9SsG+EX5VNCXfwy1zQkbrKrYkXrQTPthSZRgiuKfu+3mPY4x8pzDCDy4sNZ4wzNQtnuUnPxG9mGwj8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A850E497; Mon, 5 Jan 2026 08:17:48 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A34023F6A8; Mon, 5 Jan 2026 08:17:53 -0800 (PST) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , Uladzislau Rezki , "Vishal Moola (Oracle)" Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v1 1/2] mm/page_alloc: Optimize free_contig_range() Date: Mon, 5 Jan 2026 16:17:37 +0000 Message-ID: <20260105161741.3952456-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260105161741.3952456-1-ryan.roberts@arm.com> References: <20260105161741.3952456-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Decompose the range of order-0 pages to be freed into the set of largest possible power-of-2 size and aligned chunks and free them to the pcp or buddy. This improves on the previous approach which freed each order-0 page individually in a loop. Testing shows performance to be improved by more than 10x in some cases. Since each page is order-0, we must decrement each page's reference count individually and only consider the page for freeing as part of a high order chunk if the reference count goes to zero. Additionally free_pages_prepare() must be called for each individual order-0 page too, so that the struct page state and global accounting state can be appropriately managed. But once this is done, the resulting high order chunks can be freed as a unit to the pcp or buddy. This significiantly speeds up the free operation but also has the side benefit that high order blocks are added to the pcp instead of each page ending up on the pcp order-0 list; memory remains more readily available in high orders. vmalloc will shortly become a user of this new optimized free_contig_range() since it agressively allocates high order non-compound pages, but then calls split_page() to end up with contiguous order-0 pages. These can now be freed much more efficiently. The execution time of the following function was measured in a VM on an Apple M2 system: static int page_alloc_high_ordr_test(void) { unsigned int order =3D HPAGE_PMD_ORDER; struct page *page; int i; for (i =3D 0; i < 100000; i++) { page =3D alloc_pages(GFP_KERNEL, order); if (!page) return -1; split_page(page, order); free_contig_range(page_to_pfn(page), 1UL << order); } return 0; } Execution time before: 1684366 usec Execution time after: 136216 usec Perf trace before: 60.93% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork | ---ret_from_fork kthread 0xffffbba283e63980 | |--60.01%--0xffffbba283e636dc | | | |--58.57%--free_contig_range | | | | | |--57.19%--___free_pages | | | | | | | |--46.65%--__free_frozen_pa= ges | | | | | | | | | |--28.08%--free_= pcppages_bulk | | | | | | | | | --12.05%--free_= frozen_page_commit.constprop.0 | | | | | | | |--5.10%--__get_pfnblock_fl= ags_mask.isra.0 | | | | | | | |--1.13%--_raw_spin_unlock | | | | | | | |--0.78%--free_frozen_page_= commit.constprop.0 | | | | | | | --0.75%--_raw_spin_trylock | | | | | --0.95%--__free_frozen_pages | | | --1.44%--___free_pages | --0.78%--0xffffbba283e636c0 split_page Perf trace after: 10.62% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork | ---ret_from_fork kthread 0xffffbbd55ef74980 | |--8.74%--0xffffbbd55ef746dc | free_contig_range | | | --8.72%--__free_contig_range | --1.56%--0xffffbbd55ef746c0 | --1.54%--split_page Signed-off-by: Ryan Roberts --- include/linux/gfp.h | 1 + mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++----- 2 files changed, 106 insertions(+), 11 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b155929af5b1..3ed0bef34d0c 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -439,6 +439,7 @@ extern struct page *alloc_contig_pages_noprof(unsigned = long nr_pages, gfp_t gfp_ #define alloc_contig_pages(...) alloc_hooks(alloc_contig_pages_noprof(__= VA_ARGS__)) #endif +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_page= s); void free_contig_range(unsigned long pfn, unsigned long nr_pages); #ifdef CONFIG_CONTIG_ALLOC diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a045d728ae0f..1015c8edf8a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -91,6 +91,9 @@ typedef int __bitwise fpi_t; /* Free the page without taking locks. Rely on trylock only. */ #define FPI_TRYLOCK ((__force fpi_t)BIT(2)) +/* free_pages_prepare() has already been called for page(s) being freed. */ +#define FPI_PREPARED ((__force fpi_t)BIT(3)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -1582,8 +1585,12 @@ static void __free_pages_ok(struct page *page, unsig= ned int order, unsigned long pfn =3D page_to_pfn(page); struct zone *zone =3D page_zone(page); - if (free_pages_prepare(page, order)) - free_one_page(zone, page, pfn, order, fpi_flags); + if (!(fpi_flags & FPI_PREPARED)) { + if (!free_pages_prepare(page, order)) + return; + } + + free_one_page(zone, page, pfn, order, fpi_flags); } void __meminit __free_pages_core(struct page *page, unsigned int order, @@ -2943,8 +2950,10 @@ static void __free_frozen_pages(struct page *page, u= nsigned int order, return; } - if (!free_pages_prepare(page, order)) - return; + if (!(fpi_flags & FPI_PREPARED)) { + if (!free_pages_prepare(page, order)) + return; + } /* * We only track unmovable, reclaimable and movable on pcp lists. @@ -7250,9 +7259,99 @@ struct page *alloc_contig_pages_noprof(unsigned long= nr_pages, gfp_t gfp_mask, } #endif /* CONFIG_CONTIG_ALLOC */ +static void free_prepared_contig_range(struct page *page, + unsigned long nr_pages) +{ + while (nr_pages) { + unsigned int fit_order, align_order, order; + unsigned long pfn; + + /* + * Find the largest aligned power-of-2 number of pages that + * starts at the current page, does not exceed nr_pages and is + * less than or equal to pageblock_order. + */ + pfn =3D page_to_pfn(page); + fit_order =3D ilog2(nr_pages); + align_order =3D pfn ? __ffs(pfn) : fit_order; + order =3D min3(fit_order, align_order, pageblock_order); + + /* + * Free the chunk as a single block. Our caller has already + * called free_pages_prepare() for each order-0 page. + */ + __free_frozen_pages(page, order, FPI_PREPARED); + + page +=3D 1UL << order; + nr_pages -=3D 1UL << order; + } +} + +/** + * __free_contig_range - Free contiguous range of order-0 pages. + * @pfn: Page frame number of the first page in the range. + * @nr_pages: Number of pages to free. + * + * For each order-0 struct page in the physically contiguous range, put a + * reference. Free any page who's reference count falls to zero. The + * implementation is functionally equivalent to, but significantly faster = than + * calling __free_page() for each struct page in a loop. + * + * Memory allocated with alloc_pages(order>=3D1) then subsequently split to + * order-0 with split_page() is an example of appropriate contiguous pages= that + * can be freed with this API. + * + * Returns the number of pages which were not freed, because their referen= ce + * count did not fall to zero. + * + * Context: May be called in interrupt context or while holding a normal + * spinlock, but not in NMI context or while holding a raw spinlock. + */ +unsigned long __free_contig_range(unsigned long pfn, unsigned long nr_page= s) +{ + struct page *page =3D pfn_to_page(pfn); + unsigned long not_freed =3D 0; + struct page *start =3D NULL; + unsigned long i; + bool can_free; + + /* + * Chunk the range into contiguous runs of pages for which the refcount + * went to zero and for which free_pages_prepare() succeeded. If + * free_pages_prepare() fails we consider the page to have been freed + * deliberately leak it. + * + * Code assumes contiguous PFNs have contiguous struct pages, but not + * vice versa. + */ + for (i =3D 0; i < nr_pages; i++, page++) { + VM_BUG_ON_PAGE(PageHead(page), page); + VM_BUG_ON_PAGE(PageTail(page), page); + + can_free =3D put_page_testzero(page); + if (!can_free) + not_freed++; + else if (!free_pages_prepare(page, 0)) + can_free =3D false; + + if (!can_free && start) { + free_prepared_contig_range(start, page - start); + start =3D NULL; + } else if (can_free && !start) { + start =3D page; + } + } + + if (start) + free_prepared_contig_range(start, page - start); + + return not_freed; +} +EXPORT_SYMBOL(__free_contig_range); + void free_contig_range(unsigned long pfn, unsigned long nr_pages) { - unsigned long count =3D 0; + unsigned long count; struct folio *folio =3D pfn_folio(pfn); if (folio_test_large(folio)) { @@ -7266,12 +7365,7 @@ void free_contig_range(unsigned long pfn, unsigned l= ong nr_pages) return; } - for (; nr_pages--; pfn++) { - struct page *page =3D pfn_to_page(pfn); - - count +=3D page_count(page) !=3D 1; - __free_page(page); - } + count =3D __free_contig_range(pfn, nr_pages); WARN(count !=3D 0, "%lu pages are still in use!\n", count); } EXPORT_SYMBOL(free_contig_range); -- 2.43.0 From nobody Sat Feb 7 07:24:38 2026 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4BDBE2FFDE6 for ; Mon, 5 Jan 2026 16:17:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767629879; cv=none; b=rhtaDRWfBD0rGQyMD4S5W/HYTh8BsBiLtbiDJEu1rC1kp8HkdVJ+j1GKIseFZTiTpgbx/tkmQRYthTv8UPtIh/PFGlUuhatCwNShvXqh4gu81YuJa4rMqS+uf+u1gdPUVIICbiiwqtOYy3u+sB+mvXXaPUeQNBdUoWDnsB0vPno= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767629879; c=relaxed/simple; bh=jSokdJepNsCHZO2YdMiw6xw26HngKP+UBvwowNj8vng=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EQtqf9TqUsp5Cbve5Ymj23X6QEUThw9+K2gYwLi7MkGgVOZo5vHYRdR8BWQqdapZlbmtguR8O8cWDzbcKHDuLHO8nINJXDpZxevLo2nJQAFLotePVWWw7xGj9/0SHpIbFKqJcebCUfC3mxl3cMfncRwjOnszDzTEtn2wLjxsRfY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C66C7339; Mon, 5 Jan 2026 08:17:50 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C1F973F6A8; Mon, 5 Jan 2026 08:17:55 -0800 (PST) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , Uladzislau Rezki , "Vishal Moola (Oracle)" Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v1 2/2] vmalloc: Optimize vfree Date: Mon, 5 Jan 2026 16:17:38 +0000 Message-ID: <20260105161741.3952456-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260105161741.3952456-1-ryan.roberts@arm.com> References: <20260105161741.3952456-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it must immediately split_page() to order-0 so that it remains compatible with users that want to access the underlying struct page. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") recently made it much more likely for vmalloc to allocate high order pages which are subsequently split to order-0. Unfortunately this had the side effect of causing performance regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko benchmarks). See Closes: tag. This happens because the high order pages must be gotten from the buddy but then because they are split to order-0, when they are freed they are freed to the order-0 pcp. Previously allocation was for order-0 pages so they were recycled from the pcp. It would be preferable if when vmalloc allocates an (e.g.) order-3 page that it also frees that order-3 page to the order-3 pcp, then the regression could be removed. So let's do exactly that; use the new __free_contig_range() API to batch-free contiguous ranges of pfns. This not only removes the regression, but significantly improves performance of vfree beyond the baseline. A selection of test_vmalloc benchmarks running on AWS m7g.metal (arm64) system. v6.18 is the baseline. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") was added in v6.19-rc1 where we see regressions. Then with this change performance is much better. (>0 is faster, <0 is slower, (R)/(I) =3D statistically significant Regression/Improvement): +----------------------------------------------------------+-------------+-= ------------+ | test_vmalloc benchmark | v6.19-rc1 | = v6.19-rc1 | | | | = + change | +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | (R) -40.69% | = (I) 4.85% | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 0.10% | = -1.04% | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | (R) -22.74% | = (I) 14.12% | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | (R) -23.63% | = (I) 43.81% | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | -1.58% | = (I) 102.28% | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | (R) -24.39% | = (I) 89.64% | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | (I) 2.34% | = (I) 181.42% | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | (R) -23.29% | = (I) 111.05% | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | (I) 3.74% | = (I) 213.52% | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | (R) -23.80% | = (I) 118.28% | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | (R) -2.84% | = (I) 427.65% | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 2.74% | = -1.12% | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 0.58% | = -0.79% | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | -0.66% | = -0.91% | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | (R) -25.24% | = (I) 70.62% | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | -0.58% | = -1.27% | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | (R) -45.75% | = (I) 11.11% | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | (R) -28.16% | = (I) 59.47% | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | -0.54% | = -0.85% | +----------------------------------------------------------+-------------+-= ------------+ Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allo= cator") Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@ar= m.com/ Signed-off-by: Ryan Roberts --- mm/vmalloc.c | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 32d6ee92d4ff..86407178b6d1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3434,7 +3434,8 @@ void vfree_atomic(const void *addr) void vfree(const void *addr) { struct vm_struct *vm; - int i; + unsigned long start_pfn; + int i, nr; =20 if (unlikely(in_interrupt())) { vfree_atomic(addr); @@ -3460,17 +3461,25 @@ void vfree(const void *addr) /* All pages of vm should be charged to same memcg, so use first one. */ if (vm->nr_pages && !(vm->flags & VM_MAP_PUT_PAGES)) mod_memcg_page_state(vm->pages[0], MEMCG_VMALLOC, -vm->nr_pages); - for (i =3D 0; i < vm->nr_pages; i++) { - struct page *page =3D vm->pages[i]; =20 - BUG_ON(!page); - /* - * High-order allocs for huge vmallocs are split, so - * can be freed as an array of order-0 allocations - */ - __free_page(page); - cond_resched(); + if (vm->nr_pages) { + start_pfn =3D page_to_pfn(vm->pages[0]); + nr =3D 1; + for (i =3D 1; i < vm->nr_pages; i++) { + unsigned long pfn =3D page_to_pfn(vm->pages[i]); + + if (start_pfn + nr !=3D pfn) { + __free_contig_range(start_pfn, nr); + start_pfn =3D pfn; + nr =3D 1; + cond_resched(); + } else { + nr++; + } + } + __free_contig_range(start_pfn, nr); } + if (!(vm->flags & VM_MAP_PUT_PAGES)) atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); kvfree(vm->pages); --=20 2.43.0