From nobody Sun Feb 8 15:07:45 2026 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 41F44DDC5 for ; Sun, 21 Dec 2025 12:47:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766321257; cv=none; b=BQ8pt7ZkNIpzPJDeJ0YsiHZDsrbISU51ETedy/pqHjP4BcaJpLUK1ykXRyaannjNE6Nzu7Jm/KWGWGx4kj8u4IU/MGvNVqwXi1Ix5PDlM3xGcKB2QoJWZ2r7EbDD7n9hbbxV+a1i0ZTwOBn8KIXhwQjsWeWrBkzbFKt1LWGmxpo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766321257; c=relaxed/simple; bh=V+9LapSvH5jPkozhaw6IzF6Is4J6VGt1WQGWlGViBOw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=vFOxYnrK6zvu0QtEXk9z2dKfLumiTqMp2xeBh1Wd4Gf/M+1nufMZKbyKOtY1wLDNeQTRRHZMtQfmozmy6Yv1il4h36IAziflWQIOL8h6WIGe0w/+vKstuf+5dIxBcFjDGhBm7PaPkDpL3xOoksq7Dc9BrVZ6pO9MBESFHmie1YM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=RiuxRjrL; arc=none smtp.client-ip=209.85.160.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="RiuxRjrL" Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-4ee13dc0c52so26311371cf.2 for ; Sun, 21 Dec 2025 04:47:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1766321255; x=1766926055; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=FF+BE6xPVTJjX20lPRgrjzLZlKI8Qc5ScIMSD10bbno=; b=RiuxRjrLoC3t7HNaJdN8ok/P2SdbweK/ovHo54n6SnvbgYrYuEpO8XsVbTxMVtz0Fd SMdT0Xwtn5Ahl2KsOQm7lbZJwT+TMSLXnFIdlXaBOc6hhu2lr9oQqd+gJFJjlhQLhY4O rSLOWmNoTb5iCOVGSesMObwwY6CoJuTfDYGy7QBurCuioN+Bl4dUMeaf3A1d8GB80cyY AIocKeHVnB8uJgF4ieBjsNfBdlqOp06jc4tQZT2+WnzJMbc8q2FYtpPWBJtBkZOxiKBP BMemQR2J9o7j7b+PE6y32ZnmiZ4P4rRKJf1JDIsgcBltA+4FbLyzxYVtHl5aOcN4Eboy H60w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766321255; x=1766926055; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=FF+BE6xPVTJjX20lPRgrjzLZlKI8Qc5ScIMSD10bbno=; b=B3EXvDeo8UWBQmkfgMUll0SbI7ppjDNxD42u8ETik8APSgy3uHIdTj7Ez7SAxPrHi2 S6dAjSBbGjk4HbYWr4slx+Om7kE3E5LQR5srGI/M2QIKvZvZ2V5BHpB3JjSnOqU2TkMn etsGo34eDxKbFn1Pig8rRwdoDK5WBR+TRgmPBc5yUbFrjhYdjNU185kfoWYfuiWO83VB G52BekbTbmQ06vmgcTjfVthI567TMAIIU6XHgSVrTrGEfw7WZkWKIBPCmCZWRrhVfsFT 4e3/6ko30UrWXM2XgzHVCHlTd7gVuo0R6FyKpfhbPK4XIhT276AbZhpkmJEqJ6YoMoMU ZH7w== X-Gm-Message-State: AOJu0Yy0FpJPvA9sOOF9eIw91+FCczo0sLlKRsUlncxuMzro70gHz2dv u9U/SJp2B/reNzUCykCdXWsMZiXJq0rfKSEQTjou4NGk6aRi3s7j7qRIl8oAS8jfqFVQkhmj/Wg HpdpG X-Gm-Gg: AY/fxX5TDmi9SmBM47oumMWd5ChkI1pGhSMZDm7buRcQmMmlAHRkQDiS8KoUrDZFwrJ CCk3yGcZbWj0KzDK1aewhRQtIk6J7PbmbYy91gzL9WPuWWgCoSyWCuJjsk4+1+07jrB7nNL32cT MLy2iWVz5XrBWexQgn5QVZb95WQSKY4AKfoO/OmfFjs2uVE16Pins/PIfUHXqSr/iaf8gdwLFOe qP0zJ+x8naiieyxZ5Tx9VJk172u/h6EM1VkxBUIbyZ7Uisd1ntakwtORTav4/+uSxrhu4OmWd9V O75Ah7kudusPyWKNJnsk2IeuzqejYuSnh8yfWtuuaCrG+EU0JfUHDIuPButZbTwl2+/ddRfxY7X O/TIETnuCPn0t//RMRR8UBkfvgMJfeM4f+3LeIHDYa4g2LGC3fjNTF34qs3OOiU5Mvf9wnNO/Gp NLxyOHKmoByCX47xN8yLkQrPZqG+MQkdhGYrpK1GDNoXrd9Bv4f4+jWsl2ftP4872Dubhb+3Fsz 70= X-Google-Smtp-Source: AGHT+IFfDpM50iJ98LsRskGa80nR0EXRlUUC2Og1kXxCLzZjAWWZkzaUgH7NOi0CeHzJfK4e2sDVMQ== X-Received: by 2002:a05:622a:4a09:b0:4ed:1948:a8a2 with SMTP id d75a77b69052e-4f4abd75322mr125510121cf.40.1766321255085; Sun, 21 Dec 2025 04:47:35 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-96-255-20-138.washdc.ftas.verizon.net. [96.255.20.138]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-4f4ac5649besm55982091cf.13.2025.12.21.04.47.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 21 Dec 2025 04:47:34 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com, akpm@linux-foundation.org, vbabka@suse.cz, surenb@google.com, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, richard.weiyang@gmail.com, David Hildenbrand Subject: [PATCH v7] page_alloc: allow migration of smaller hugepages during contig_alloc Date: Sun, 21 Dec 2025 07:46:56 -0500 Message-ID: <20251221124656.2362540-1-gourry@gourry.net> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" We presently skip regions with hugepages entirely when trying to do contiguous page allocation. This will cause otherwise-movable 2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic page allocation less reliable on systems utilizing both. Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages") skipped all HugePage containing regions because it can cause significant delays in 1G allocation (as HugeTLB migrations may fail for a number of reasons). Instead, if hugepage migration is enabled, consider regions with hugepages smaller than the target contiguous allocation request as valid targets for allocation. We optimize for the existing behavior by searching for non-hugetlb regions in a first pass, then retrying the search to include hugetlb only on failure. This allows the existing fast-path to remain the default case with a slow-path fallback to increase reliability. We only fallback to the slow path if a hugetlb region was detected, and we do a full re-scan because the zones/blocks may have changed during the first pass (and it's not worth further complexity). isolate_migrate_pages_block() has similar hugetlb filter logic, and the hugetlb code does a migratable check in folio_isolate_hugetlb() during isolation. The code servicing the allocation and migration already supports this exact use case. To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB) and then attempt to allocate some 1G HugeTLB pages (in this case 4GB) (Scale to your machine's memory capacity). echo 24576 > .../hugepages-2048kB/nr_hugepages echo 4 > .../hugepages-1048576kB/nr_hugepages Prior to this patch, the 1GB page reservation can fail if no contiguous 1GB pages remain. After this patch, the kernel will try to move 2MB pages and successfully allocate the 1GB pages (assuming overall sufficient memory is available). Also tested this while a program had the 2MB reservations mapped, and the 1GB reservation still succeeds. folio_alloc_gigantic() is the primary user of alloc_contig_pages(), other users are debug or init-time allocations and largely unaffected. - ppc/memtrace is a debugfs interface - x86/tdx memory allocation occurs once on module-init - kfence/core happens once on module (late) init - THP uses it in debug_vm_pgtable_alloc_huge_page at __init time Suggested-by: David Hildenbrand Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00= @redhat.com/ Signed-off-by: Gregory Price Reviewed-by: Zi Yan Reviewed-by: Wei Yang Acked-by: David Hildenbrand (Red Hat) Acked-by: Michal Hocko --- mm/page_alloc.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 822e05f1a964..c3054e40b95b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7083,7 +7083,8 @@ static int __alloc_contig_pages(unsigned long start_p= fn, } =20 static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, - unsigned long nr_pages) + unsigned long nr_pages, bool skip_hugetlb, + bool *skipped_hugetlb) { unsigned long i, end_pfn =3D start_pfn + nr_pages; struct page *page; @@ -7099,8 +7100,42 @@ static bool pfn_range_valid_contig(struct zone *z, u= nsigned long start_pfn, if (PageReserved(page)) return false; =20 - if (PageHuge(page)) - return false; + /* + * Only consider ranges containing hugepages if those pages are + * smaller than the requested contiguous region. e.g.: + * Move 2MB pages to free up a 1GB range. + * Don't move 1GB pages to free up a 2MB range. + * + * This makes contiguous allocation more reliable if multiple + * hugepage sizes are used without causing needless movement. + */ + if (PageHuge(page)) { + unsigned int order; + + if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION)) + return false; + + if (skip_hugetlb) { + *skipped_hugetlb =3D true; + return false; + } + + page =3D compound_head(page); + order =3D compound_order(page); + if ((order >=3D MAX_FOLIO_ORDER) || + (nr_pages <=3D (1 << order))) + return false; + + /* + * Reaching this point means we've encounted a huge page + * smaller than nr_pages, skip all pfn's for that page. + * + * We can't get here from a tail-PageHuge, as it implies + * we started a scan in the middle of a hugepage larger + * than nr_pages - which the prior check filters for. + */ + i +=3D (1 << order) - 1; + } } return true; } @@ -7143,7 +7178,10 @@ struct page *alloc_contig_pages_noprof(unsigned long= nr_pages, gfp_t gfp_mask, struct zonelist *zonelist; struct zone *zone; struct zoneref *z; + bool skip_hugetlb =3D true; + bool skipped_hugetlb =3D false; =20 +retry: zonelist =3D node_zonelist(nid, gfp_mask); for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) { @@ -7151,7 +7189,9 @@ struct page *alloc_contig_pages_noprof(unsigned long = nr_pages, gfp_t gfp_mask, =20 pfn =3D ALIGN(zone->zone_start_pfn, nr_pages); while (zone_spans_last_pfn(zone, pfn, nr_pages)) { - if (pfn_range_valid_contig(zone, pfn, nr_pages)) { + if (pfn_range_valid_contig(zone, pfn, nr_pages, + skip_hugetlb, + &skipped_hugetlb)) { /* * We release the zone lock here because * alloc_contig_range() will also lock the zone @@ -7170,6 +7210,17 @@ struct page *alloc_contig_pages_noprof(unsigned long= nr_pages, gfp_t gfp_mask, } spin_unlock_irqrestore(&zone->lock, flags); } + /* + * If we failed, retry the search, but treat regions with HugeTLB pages + * as valid targets. This retains fast-allocations on first pass + * without trying to migrate HugeTLB pages (which may fail). On the + * second pass, we will try moving HugeTLB pages when those pages are + * smaller than the requested contiguous region size. + */ + if (skip_hugetlb && skipped_hugetlb) { + skip_hugetlb =3D false; + goto retry; + } return NULL; } #endif /* CONFIG_CONTIG_ALLOC */ --=20 2.52.0