From nobody Mon May 25 04:33:38 2026
Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com
 [91.218.175.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A07C6229B12
	for <linux-kernel@vger.kernel.org>; Tue, 19 May 2026 01:25:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=91.218.175.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779153951; cv=none;
 b=q140eFRGTgFBYJJHBqx2UhxLl2N+YF+mvhCPZdz0/rimIjOErQO0BCl2fvF3h+MLL2NWNsDOSZ86LvhegQ8lgrgpHE/yfNt7J0oiC8PGHB5O12TMjvowENtlQFQqW17vC5R5jvNqsuN3AyamyN7+FZlQN0bWgAfyV7OBsdzIUpc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779153951; c=relaxed/simple;
	bh=pjM0WtJxIExEXvpkR2i7tjsUeXGWbhPUaJ3+29q6kLQ=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=d7+rfymcvZCminl1idrNAvI2sEFhmUiYDafyPwYK7bZ3gwDFLTHFsaBg8+n1S2M6b+nOdyqzBE1GSqGuWshRwg/3upVPkDDYKR/F1mIgUxFKEzKZI50VdGTR5CpM/o0K0WcmwEY7PODnoYKfSuXz1xWPqEtK6U39k3hwUaTnOGA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=Zf8ofgbU; arc=none smtp.client-ip=91.218.175.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="Zf8ofgbU"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1779153946;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding;
	bh=/6vG17en364be4oVrc/061pLrqYUD/S8MrGyGe4hKh4=;
	b=Zf8ofgbU/udYmUnVEAipdw/vDGqJ5paYw532aB40krWQ13PhkAD0tdfoitZ0qbairzlwiA
	XHv0XbX2GtcCX/3Zkqkb3lNgGFu3xc0I97XBwS8M4vlkoGg+8cG171wTOcTlWnogUyUwXh
	+INff1Y+khK3iY6kVEOutIw9Mu5UpQw=
From: "JP Kobryn (Meta)" <jp.kobryn@linux.dev>
To: akpm@linux-foundation.org,
	vbabka@kernel.org,
	surenb@google.com,
	mhocko@suse.com,
	jackmanb@google.com,
	hannes@cmpxchg.org,
	ziy@nvidia.com,
	linux-mm@kvack.org
Cc: usama.arif@linux.dev,
	kirill@shutemov.name,
	willy@infradead.org,
	linux-kernel@vger.kernel.org,
	kernel-team@meta.com
Subject: [PATCH] mm/page_alloc: skip high atomic reservation at or below
 costly order
Date: Mon, 18 May 2026 18:25:32 -0700
Message-ID: <20260519012532.272770-1-jp.kobryn@linux.dev>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Migadu-Flow: FLOW_OUT
Content-Type: text/plain; charset="utf-8"

We're seeing a pattern in production where 2MB THP order-9 allocations are
failing due to fragmentation and triggering reclaim on systems with plenty
of free memory. Over time, the success rate of these THP allocations do not
increase at all.

Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
indicated the given zone had sufficient free pages for order-9 allocations,
yet they were going unused. Drilling down into the zone and inspecting
/proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
zone's HighAtomic bucket (while zero were present in Movable). THP is
unable to draw blocks from HighAtomic since that bucket is not in the
fallback list.

The heuristic for reserving pageblocks in HighAtomic is that any atomic
allocation greater than order-0 will result in the full pageblock being
captured. This means that an order-1 atomic allocation will over-reserve by
256x, a full 512 pageblock.

Gate the reservation on order. Skip for allocations at or below
PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
reserving entire pageblocks, and significantly helps when THP is in use on
a fragmented but otherwise healthy system.

Testing was performed using an A/B instagram workload receiving prod
traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
several gains:

Unpatched
HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
  ...all order-9 blocks in HighAtomic
THP success rate: 1-6%
Compaction success rate: 0-2%
pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
Atomic order-4+ allocations: 0

Patched
HighAtomic pageblocks per host: 1
THP success rate: 44-78%
Compaction success rate: 24-47%
pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
Atomic order-4+ allocations: 0

Note that for this workload all atomic allocations were order 0-3
originating from the network stack, btrfs, and scheduler.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e262d1316259d..45d8f6844f510 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3446,6 +3446,13 @@ static void reserve_highatomic_pageblock(struct page=
 *page, int order,
 	int mt;
 	unsigned long max_managed;
=20
+	/*
+	 * Don't reserve a pageblock for lower orders.
+	 * Order 1-3 allocs should not capture a huge page size block.
+	 */
+	if (order <=3D PAGE_ALLOC_COSTLY_ORDER)
+		return;
+
 	/*
 	 * The number reserved as: minimum is 1 pageblock, maximum is
 	 * roughly 1% of a zone. But if 1% of a zone falls below a
--=20
2.53.0-Meta