From nobody Mon Feb  9 22:23:15 2026
Received: from mail-qv1-f65.google.com (mail-qv1-f65.google.com
 [209.85.219.65])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEE6F20311
	for <linux-kernel@vger.kernel.org>; Thu, 18 Dec 2025 23:38:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.65
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1766101133; cv=none;
 b=UEvQUuejpmR5LstTr4JJ9Oipt7bHBKc+1+rXyT4AX9VKyldUdjp99AxnHz/HqH5IhALNMj1ZPBvFwYWu8uybEGDbJ6pVi8f3bgNfiCYX0r2quwISGJGMtPqHjN3NbVdJwWik0Q5V6zCAvfU3yRlNE2Yt8hMkookBs35wzx3ohWs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1766101133; c=relaxed/simple;
	bh=TZbtu8Pmt0nEWPXYKj6u188OVipCBzhMspt+bLh9OUY=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=izwg1NALtrrhOr+evD0B4gMPQIzAb5D9L1S0eqBLKXkun7Bncs/651ImmFI8P27QkBZt9r3XdilB8dLUv5u4h8ED2ziXJAQ7UxLh9KSN2UGHzJTbfHKxyMBOFfWVPTDW+/blRUaN40r4xvgY5qd5nXHjY0kVAznEeFnKpbyAynA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=gourry.net;
 spf=pass smtp.mailfrom=gourry.net;
 dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net
 header.b=UdJfTYjG; arc=none smtp.client-ip=209.85.219.65
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=gourry.net
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gourry.net
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net
 header.b="UdJfTYjG"
Received: by mail-qv1-f65.google.com with SMTP id
 6a1803df08f44-88888d80590so15451836d6.3
        for <linux-kernel@vger.kernel.org>;
 Thu, 18 Dec 2025 15:38:50 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gourry.net; s=google; t=1766101130; x=1766705930;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=tvAQKocepJXlQomnlA1RqKzLy9CroMl6hVERKOjnKGA=;
        b=UdJfTYjGJy/eZ0mCEh64SD0AjNAl+Q8rR/O2Q2+jhuj7PEyw35L+NCdsoLLGt6o/+E
         11q+f4lOv5p0i4dudaXQ7ZcHsyJ+Lbes3Cf/NnLw4EBpQYVSQ/QdC+68QV/eiWi/FopW
         Ewfl46vHNfxemrtb9bplA6uEJ1Ob9MeJJmKInGU5+ZWNRr4U3gTYdNEZlbvM+dn1gJSO
         k62VgyO4CMj0HPUh2uML894KFhXHOs17xpcDINOgSP1F2xVvnLss1LjrN8w5vdhD/oHD
         Qx5Xfi/ZA4Y08CIZ4OJa7Hn/1ybgHRRSW+jNJsKc4vZnjn79bjZsp+x6JLVnvDTAOWnF
         dITw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1766101130; x=1766705930;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tvAQKocepJXlQomnlA1RqKzLy9CroMl6hVERKOjnKGA=;
        b=j/78hYqvbOQCFfseC6/y9v+jG5Tjko5eFbmf9MAZpXrr+4LmIiGjv7d96uXFKlfdSj
         8A+KLqa4qd+GjA5Tkt99RxG6Qtwr/9V6mH7oAwkhgQFeBJ7YARCf4rfe24P0osknC2Sa
         Xu52uOQDxT9E7pxkbNcQoTEkiW1eRBd5ZTQejAzEqLM15IDa0uiz9D/YFFCOacIkrkqQ
         I1EunJ9kXqSrNUaCUS8RHZEd4BAfH7wW6EV9kdDKx3xGyDgzb4VE7L8HBhfXleoH2aoE
         SkUrnYaxE6tNibqo7rXh5+8Zzdsbu/Oc9z3D73PcrhUfPlWru71+4g4N0/iFmuPRntB7
         FMgQ==
X-Gm-Message-State: AOJu0Yz0e4TRUMP47YbtTRkxIUfUboOHMWVrhMuAgOaczFcr5mm3x/Lx
	ESvi8KYg5Wp4kqa/LqWLT9SjHfGx5vDIFHrDVZ13zIF03fV0u6WMB0Gr6He9VTRk4cA=
X-Gm-Gg: AY/fxX6NIlQYxuVpm93V5az89ACRcPksPzVUmFrNDtagdvDKlM7V87Qnb3rTHqIFcRm
	0yTcsbICK3sJ3AzmgJKkQa6ZS/q0fSE/oFfAPqPfJcymiYJGk++kTzY62l8LbnuwTj2fVEt/qLN
	5J/4eFqFurJHNgsypRd17VvRyn/kz2MUjiGzwq2D5iwmWwnlJE6UnYQQ6ECGVF/YfRRMUXkxBGS
	sYcPZVbJ/aBcxIvhef5pqNrxRrf7Pfs7NtAJ58C9uq6nleSt1BB66xiHzkVQlV0Jfspxjo360ye
	yVfst1PJ7LiNjLYnpmhjlbjsRz4mG8FA2ta1m/aG8a4dVL9UrblYNS+Glaf8N/eYUakx7HiQRIh
	14BBpl5Yw0B3XL/6D3hWByoyWnYcG6JiTI3auugapBBP6oE3Jn5Cc2gsD1LYQi4Km7D1xAa9/cz
	LVQeZ3YMJT7+lh9LM+1ab7bpqHwm+UmexwzRbmePoxNCaYFE21zBhtAc01RzScPqqA9MCH1+x0+
	Rk=
X-Google-Smtp-Source: 
 AGHT+IHn2tlBreoEmddv+JiS6TqCx02FNrvMqMBmV0I9RTHUdMUlU0kI2fV7B7cgbnpAoWde+7Zrxg==
X-Received: by 2002:a05:6214:5006:b0:888:8460:837d with SMTP id
 6a1803df08f44-88d87af38b1mr21387706d6.54.1766101129641;
        Thu, 18 Dec 2025 15:38:49 -0800 (PST)
Received: from gourry-fedora-PF4VCD3F.lan
 (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138])
        by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-88d997aef2esm6480556d6.32.2025.12.18.15.38.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 18 Dec 2025 15:38:49 -0800 (PST)
From: Gregory Price <gourry@gourry.net>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	kernel-team@meta.com,
	akpm@linux-foundation.org,
	vbabka@suse.cz,
	surenb@google.com,
	mhocko@suse.com,
	jackmanb@google.com,
	hannes@cmpxchg.org,
	ziy@nvidia.com,
	richard.weiyang@gmail.com,
	osalvador@suse.de,
	rientjes@google.com,
	david@redhat.com,
	joshua.hahnjy@gmail.com,
	fvdl@google.com
Subject: [PATCH v6] page_alloc: allow migration of smaller hugepages during
 contig_alloc
Date: Thu, 18 Dec 2025 18:38:04 -0500
Message-ID: <20251218233804.1395835-1-gourry@gourry.net>
X-Mailer: git-send-email 2.52.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

We presently skip regions with hugepages entirely when trying to do
contiguous page allocation.  This will cause otherwise-movable
2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic
page allocation less reliable on systems utilizing both.

Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages
when allocating 1G pages") skipped all HugePage containing regions
because it can cause significant delays in 1G allocation (as HugeTLB
migrations may fail for a number of reasons).

Instead, if hugepage migration is enabled, consider regions with
hugepages smaller than the target contiguous allocation request
as valid targets for allocation.

We optimize for the existing behavior by searching for non-hugetlb
regions in a first pass, then retrying the search to include hugetlb
only on failure.  This allows the existing fast-path to remain the
default case with a slow-path fallback to increase reliability.

We only fallback to the slow path if a hugetlb region was detected,
and we do a full re-scan because the zones/blocks may have changed
during the first pass (and it's not worth further complexity).

isolate_migrate_pages_block() has similar hugetlb filter logic, and
the hugetlb code does a migratable check in folio_isolate_hugetlb()
during isolation.  The code servicing the allocation and migration
already supports this exact use case.

To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB)
and then attempt to allocate some 1G HugeTLB pages (in this case 4GB)
(Scale to your machine's memory capacity).

echo 24576 > .../hugepages-2048kB/nr_hugepages
echo 4 > .../hugepages-1048576kB/nr_hugepages

Prior to this patch, the 1GB page reservation can fail if no contiguous
1GB pages remain.  After this patch, the kernel will try to move 2MB
pages and successfully allocate the 1GB pages (assuming overall
sufficient memory is available). Also tested this while a program had
the 2MB reservations mapped, and the 1GB reservation still succeeds.

folio_alloc_gigantic() is the primary user of alloc_contig_pages(),
other users are debug or init-time allocations and largely unaffected.
- ppc/memtrace is a debugfs interface
- x86/tdx memory allocation occurs once on module-init
- kfence/core happens once on module (late) init
- THP uses it in debug_vm_pgtable_alloc_huge_page at __init time

Suggested-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00=
@redhat.com/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c | 52 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..adf579a0df3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_p=
fn,
 }
=20
 static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
-				   unsigned long nr_pages)
+				   unsigned long nr_pages, bool skip_hugetlb,
+				   bool *skipped_hugetlb)
 {
 	unsigned long i, end_pfn =3D start_pfn + nr_pages;
 	struct page *page;
@@ -7099,8 +7100,35 @@ static bool pfn_range_valid_contig(struct zone *z, u=
nsigned long start_pfn,
 		if (PageReserved(page))
 			return false;
=20
-		if (PageHuge(page))
-			return false;
+		/*
+		 * Only consider ranges containing hugepages if those pages are
+		 * smaller than the requested contiguous region.  e.g.:
+		 *     Move 2MB pages to free up a 1GB range.
+		 *     Don't move 1GB pages to free up a 2MB range.
+		 *
+		 * This makes contiguous allocation more reliable if multiple
+		 * hugepage sizes are used without causing needless movement.
+		 */
+		if (PageHuge(page)) {
+			unsigned int order;
+
+			if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
+				return false;
+
+			if (skip_hugetlb) {
+				*skipped_hugetlb =3D true;
+				return false;
+			}
+
+			page =3D compound_head(page);
+			order =3D compound_order(page);
+			if ((order >=3D MAX_FOLIO_ORDER) ||
+			    (nr_pages <=3D (1 << order)))
+				return false;
+
+			/* No need to check the pfns for this page */
+			i +=3D (1 << order) - 1;
+		}
 	}
 	return true;
 }
@@ -7143,7 +7171,10 @@ struct page *alloc_contig_pages_noprof(unsigned long=
 nr_pages, gfp_t gfp_mask,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	bool skip_hugetlb =3D true;
+	bool skipped_hugetlb =3D false;
=20
+retry:
 	zonelist =3D node_zonelist(nid, gfp_mask);
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(gfp_mask), nodemask) {
@@ -7151,7 +7182,9 @@ struct page *alloc_contig_pages_noprof(unsigned long =
nr_pages, gfp_t gfp_mask,
=20
 		pfn =3D ALIGN(zone->zone_start_pfn, nr_pages);
 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
+			if (pfn_range_valid_contig(zone, pfn, nr_pages,
+						   skip_hugetlb,
+						   &skipped_hugetlb)) {
 				/*
 				 * We release the zone lock here because
 				 * alloc_contig_range() will also lock the zone
@@ -7170,6 +7203,17 @@ struct page *alloc_contig_pages_noprof(unsigned long=
 nr_pages, gfp_t gfp_mask,
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
+	/*
+	 * If we failed, retry the search, but treat regions with HugeTLB pages
+	 * as valid targets.  This retains fast-allocations on first pass
+	 * without trying to migrate HugeTLB pages (which may fail). On the
+	 * second pass, we will try moving HugeTLB pages when those pages are
+	 * smaller than the requested contiguous region size.
+	 */
+	if (skip_hugetlb && skipped_hugetlb) {
+		skip_hugetlb =3D false;
+		goto retry;
+	}
 	return NULL;
 }
 #endif /* CONFIG_CONTIG_ALLOC */
--=20
2.52.0