From nobody Wed Dec 17 15:35:21 2025 Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DDFB1F463F for ; Thu, 13 Mar 2025 21:06:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900023; cv=none; b=VOHhcDieSuRnOdJT9s91wGnc4jQS/KMqFdYflO1r1uj/OUGK/n4uQBbH9sEB7BCdQdVMnIT3JYGe9XZHPfxmTEQi99WBI+7Fyc5RDDWhz9MTi7wVR4zlwTv28GcBDaC2M6/u+ziQ+MziXPo5+ZD4vMU7XKag1+zL5G0H5xhHzLU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900023; c=relaxed/simple; bh=fJg1sbubYJbEynraiys1Xl4shibh+vkLbDd7XREzGuY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pu2V9tBDopnlOHxhIGQ3eZG6mqT8eMuMgOzbziGzA004xo+z5KbEKcnsE5nIL2TqXM7+e+QCW5Bjc+a8P02IQwp9DIliZhbBb51WOhRh4TrpwNRyqde+Uv2bud6wSjvDFOGbND1Lm4z7lCR2iowNrXPbf37dixDB/x2XJFrhJbs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=sT9d9hXn; arc=none smtp.client-ip=209.85.160.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="sT9d9hXn" Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-4769aef457bso16223561cf.2 for ; Thu, 13 Mar 2025 14:06:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900018; x=1742504818; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=jrFqfcieZOq9at8nTRvNC7n7QX7bSrdSSf1BjUE4qFU=; b=sT9d9hXnhFdayOotyAI8Ec/a9d/4x87RiWayDZswu15o8a0mr4/G40mSAJRYz9UJ+w BdSL4z+ysRNYGf1tjGM9wKaI6cuNuX7wCay2FG1H/QGTktxpdi4hzeoxRn3OzXOslAcu t0OCsjWVuuNkNBG/4TaDTZy/3kZ8TadAVO0cguUoXgUefBsABV7kTmuhKc5eB89NS9Li UCQCbdN7XRCUULVs/h5q2Dw+RRxdFuhWhbnd2Kwot9ixO2B9fiO+JJJxElENepL5hn9W 7N41HcZyc6qAoJSS8f6vUgnA0Cs4piEiei6A1UHmPKxqgJ0wS9yajMX0fCNDmPX8+UuF qHnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900018; x=1742504818; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jrFqfcieZOq9at8nTRvNC7n7QX7bSrdSSf1BjUE4qFU=; b=WA+vE4taTfTca1hUEdCT+fg3d+acrRbdFaraXFUVkATd0CIlWYbvjCMUAoZKeqhmAu RJmteGnFyi0Juj+a66bWQkUAPqZtxqsZrdQO9zrEnUWUAhgxXnPrtYBvLlN1Q2y88AWK yaudCe2A26NMesi9k4+RPb2IYqmSoOYSgDdj9vYbcy70SSiHItUnXnDN9ecFZEnxkgyA OOJzaqllWOMf1M1eYuladHdK0jNrd7nX06h9IEZrSzWhkLgxGyg6hPBllwDAXZqcyzFY ho1uEHqDyElzImR/Diyd8622qNnskmu3fMOIisr0Jh7fe8QhY0YvacF14brFIp4i0+AX MRfg== X-Forwarded-Encrypted: i=1; AJvYcCX4k9AX+gjNJ3YBphmkDkilplLN739f2p7qDjkXOeA4XdnXmITxm7p3RpIGMwZ21SNqjymmCW2HZ7VV4lo=@vger.kernel.org X-Gm-Message-State: AOJu0YxOiFT7MJx/yB7/1Lmcy330PuJ3FMEjtrEdFusTcmyABEBwc0nF RiZ5PMeksMXBia/xPaCbxZcd81L81o2kY2wXaK0EQ2xNuSm+EWPkA/P/Yik8qf0= X-Gm-Gg: ASbGncuvUw5JmtnysK59r1NsZd2T1K4QpmR7uEO0kyj1x9vQJmFqrwY+hXAhGMxwC5l uvs4vOgHoXkgpqEGbFccdFqxZup5VwqTHuRSQVDkvmau1XhZzbZ0TBXU6Ft214MV30FJsdfRZXm WyiQV05rtzf4bqmwcMuGfsHRf81dTdmlR1IE7xFlJpkyrGz2BPaJP4ziz2QXNzNwkdIqKeDFlK4 Uv5u25/U3thQ06HFCB7mmU4FidIckjor+h6ooX3SDe5njxtxdq5KirSFNzDjYlGAHZJlm1bnYgN xzbT8v088ZNs4VJBVcOaeKe2PcLNCMgzNlaXWmk6JL0= X-Google-Smtp-Source: AGHT+IEDyN0M6bIcmPwzaon/X4MJTAlHo+m+s2WcqlwJf5d95cUnnQdEITXcrYwXBCRzts6TlH0+Zw== X-Received: by 2002:a05:622a:45:b0:476:84c0:4864 with SMTP id d75a77b69052e-476c8152dfbmr1632161cf.31.1741900018472; Thu, 13 Mar 2025 14:06:58 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eade231256sm13909986d6.32.2025.03.13.14.06.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:06:56 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/5] mm: compaction: push watermark into compaction_suitable() callers Date: Thu, 13 Mar 2025 17:05:32 -0400 Message-ID: <20250313210647.1314586-2-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" compaction_suitable() hardcodes the min watermark, with a boost to the low watermark for costly orders. However, compaction_ready() requires order-0 at the high watermark. It currently checks the marks twice. Make the watermark a parameter to compaction_suitable() and have the callers pass in what they require: - compaction_zonelist_suitable() is used by the direct reclaim path, so use the min watermark. - compact_suit_allocation_order() has a watermark in context derived from cc->alloc_flags. The only quirk is that kcompactd doesn't initialize cc->alloc_flags explicitly. There is a direct check in kcompactd_do_work() that passes ALLOC_WMARK_MIN, but there is another check downstack in compact_zone() that ends up passing the unset alloc_flags. Since they default to 0, and that coincides with ALLOC_WMARK_MIN, it is correct. But it's subtle. Set cc->alloc_flags explicitly. - should_continue_reclaim() is direct reclaim, use the min watermark. - Finally, consolidate the two checks in compaction_ready() to a single compaction_suitable() call passing the high watermark. There is a tiny change in behavior: before, compaction_suitable() would check order-0 against min or low, depending on costly order. Then there'd be another high watermark check. Now, the high watermark is passed to compaction_suitable(), and the costly order-boost (low - min) is added on top. This means compaction_ready() sets a marginally higher target for free pages. In a kernelbuild + THP pressure test, though, this didn't show any measurable negative effects on memory pressure or reclaim rates. As the comment above the check says, reclaim is usually stopped short on should_continue_reclaim(), and this just defines the worst-case reclaim cutoff in case compaction is not making any headway. Signed-off-by: Johannes Weiner --- include/linux/compaction.h | 5 ++-- mm/compaction.c | 52 ++++++++++++++++++++------------------ mm/vmscan.c | 26 ++++++++++--------- 3 files changed, 45 insertions(+), 38 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 7bf0c521db63..173d9c07a895 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -95,7 +95,7 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp= _mask, struct page **page); extern void reset_isolation_suitable(pg_data_t *pgdat); extern bool compaction_suitable(struct zone *zone, int order, - int highest_zoneidx); + unsigned long watermark, int highest_zoneidx); =20 extern void compaction_defer_reset(struct zone *zone, int order, bool alloc_success); @@ -113,7 +113,8 @@ static inline void reset_isolation_suitable(pg_data_t *= pgdat) } =20 static inline bool compaction_suitable(struct zone *zone, int order, - int highest_zoneidx) + unsigned long watermark, + int highest_zoneidx) { return false; } diff --git a/mm/compaction.c b/mm/compaction.c index 550ce5021807..036353ef1878 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2382,40 +2382,42 @@ static enum compact_result compact_finished(struct = compact_control *cc) } =20 static bool __compaction_suitable(struct zone *zone, int order, - int highest_zoneidx, - unsigned long wmark_target) + unsigned long watermark, int highest_zoneidx, + unsigned long free_pages) { - unsigned long watermark; /* * Watermarks for order-0 must be met for compaction to be able to * isolate free pages for migration targets. This means that the - * watermark and alloc_flags have to match, or be more pessimistic than - * the check in __isolate_free_page(). We don't use the direct - * compactor's alloc_flags, as they are not relevant for freepage - * isolation. We however do use the direct compactor's highest_zoneidx - * to skip over zones where lowmem reserves would prevent allocation - * even if compaction succeeds. - * For costly orders, we require low watermark instead of min for - * compaction to proceed to increase its chances. + * watermark have to match, or be more pessimistic than the check in + * __isolate_free_page(). + * + * For costly orders, we require a higher watermark for compaction to + * proceed to increase its chances. + * + * We use the direct compactor's highest_zoneidx to skip over zones + * where lowmem reserves would prevent allocation even if compaction + * succeeds. + * * ALLOC_CMA is used, as pages in CMA pageblocks are considered - * suitable migration targets + * suitable migration targets. */ - watermark =3D (order > PAGE_ALLOC_COSTLY_ORDER) ? - low_wmark_pages(zone) : min_wmark_pages(zone); watermark +=3D compact_gap(order); + if (order > PAGE_ALLOC_COSTLY_ORDER) + watermark +=3D low_wmark_pages(zone) - min_wmark_pages(zone); return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx, - ALLOC_CMA, wmark_target); + ALLOC_CMA, free_pages); } =20 /* * compaction_suitable: Is this suitable to run compaction on this zone no= w? */ -bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx) +bool compaction_suitable(struct zone *zone, int order, unsigned long water= mark, + int highest_zoneidx) { enum compact_result compact_result; bool suitable; =20 - suitable =3D __compaction_suitable(zone, order, highest_zoneidx, + suitable =3D __compaction_suitable(zone, order, highest_zoneidx, watermar= k, zone_page_state(zone, NR_FREE_PAGES)); /* * fragmentation index determines if allocation failures are due to @@ -2453,6 +2455,7 @@ bool compaction_suitable(struct zone *zone, int order= , int highest_zoneidx) return suitable; } =20 +/* Used by direct reclaimers */ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, int alloc_flags) { @@ -2475,8 +2478,8 @@ bool compaction_zonelist_suitable(struct alloc_contex= t *ac, int order, */ available =3D zone_reclaimable_pages(zone) / order; available +=3D zone_page_state_snapshot(zone, NR_FREE_PAGES); - if (__compaction_suitable(zone, order, ac->highest_zoneidx, - available)) + if (__compaction_suitable(zone, order, min_wmark_pages(zone), + ac->highest_zoneidx, available)) return true; } =20 @@ -2513,13 +2516,13 @@ compaction_suit_allocation_order(struct zone *zone,= unsigned int order, */ if (order > PAGE_ALLOC_COSTLY_ORDER && async && !(alloc_flags & ALLOC_CMA)) { - watermark =3D low_wmark_pages(zone) + compact_gap(order); - if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, - 0, zone_page_state(zone, NR_FREE_PAGES))) + if (!__zone_watermark_ok(zone, 0, watermark + compact_gap(order), + highest_zoneidx, 0, + zone_page_state(zone, NR_FREE_PAGES))) return COMPACT_SKIPPED; } =20 - if (!compaction_suitable(zone, order, highest_zoneidx)) + if (!compaction_suitable(zone, order, watermark, highest_zoneidx)) return COMPACT_SKIPPED; =20 return COMPACT_CONTINUE; @@ -3082,6 +3085,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) .mode =3D MIGRATE_SYNC_LIGHT, .ignore_skip_hint =3D false, .gfp_mask =3D GFP_KERNEL, + .alloc_flags =3D ALLOC_WMARK_MIN, }; enum compact_result ret; =20 @@ -3100,7 +3104,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) continue; =20 ret =3D compaction_suit_allocation_order(zone, - cc.order, zoneid, ALLOC_WMARK_MIN, + cc.order, zoneid, cc.alloc_flags, false); if (ret !=3D COMPACT_CONTINUE) continue; diff --git a/mm/vmscan.c b/mm/vmscan.c index 2bc740637a6c..3370bdca6868 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5890,12 +5890,15 @@ static inline bool should_continue_reclaim(struct p= glist_data *pgdat, =20 /* If compaction would go ahead or the allocation would succeed, stop */ for_each_managed_zone_pgdat(zone, pgdat, z, sc->reclaim_idx) { + unsigned long watermark =3D min_wmark_pages(zone); + /* Allocation can already succeed, nothing to do */ - if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone), + if (zone_watermark_ok(zone, sc->order, watermark, sc->reclaim_idx, 0)) return false; =20 - if (compaction_suitable(zone, sc->order, sc->reclaim_idx)) + if (compaction_suitable(zone, sc->order, watermark, + sc->reclaim_idx)) return false; } =20 @@ -6122,22 +6125,21 @@ static inline bool compaction_ready(struct zone *zo= ne, struct scan_control *sc) sc->reclaim_idx, 0)) return true; =20 - /* Compaction cannot yet proceed. Do reclaim. */ - if (!compaction_suitable(zone, sc->order, sc->reclaim_idx)) - return false; - /* - * Compaction is already possible, but it takes time to run and there - * are potentially other callers using the pages just freed. So proceed - * with reclaim to make a buffer of free pages available to give - * compaction a reasonable chance of completing and allocating the page. + * Direct reclaim usually targets the min watermark, but compaction + * takes time to run and there are potentially other callers using the + * pages just freed. So target a higher buffer to give compaction a + * reasonable chance of completing and allocating the pages. + * * Note that we won't actually reclaim the whole buffer in one attempt * as the target watermark in should_continue_reclaim() is lower. But if * we are already above the high+gap watermark, don't reclaim at all. */ - watermark =3D high_wmark_pages(zone) + compact_gap(sc->order); + watermark =3D high_wmark_pages(zone); + if (compaction_suitable(zone, sc->order, watermark, sc->reclaim_idx)) + return true; =20 - return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); + return false; } =20 static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_contro= l *sc) --=20 2.48.1 From nobody Wed Dec 17 15:35:21 2025 Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 57F431F4C98 for ; Thu, 13 Mar 2025 21:07:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900024; cv=none; b=lJZKNPT/enYkId4SCIvjmf3ESXdR2mL8x9keQv8WbGaGptxovD9nQe3DxSJERoZKs+a1QgoXHCYObdKb0RBFAQ9RIIme5lWNpVMuAoeQb59coIRMKMURMjkxinhbI7hAwJ2kafzNPp3CvtFhHAgFuGmMSyv4IL0OAr3wyptwTWk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900024; c=relaxed/simple; bh=N5ZPnD9eg87YeBGPF64tK0I+Kg3TFtErGSzVYhPtzeI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=spt8tB645wG3QOwDZ04Art98DXF7N79rUABtViHzbQdAO9BqYySZ4wtQZvRPNoWsdHaR8nPOMYf5L4ix0AkChQts/Bq9A7fzdtVEiVVlFETyUGRgXVXXsKl4LhOV2mxYuvoPVtLlpZSscOD+TDUWc6mu9SuQcojvJlTNqOV/v4s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=cTYK3L2m; arc=none smtp.client-ip=209.85.222.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="cTYK3L2m" Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7be6fdeee35so239369485a.1 for ; Thu, 13 Mar 2025 14:07:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900021; x=1742504821; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=2esPX4cdQSzTzPsUwG+c+ZkpjEsRNEpWoM7zt95z5xg=; b=cTYK3L2maOz/3G3S7z0B0NYaoKGoWhMHNOGQ5jblhMRdYZC3bZhFr/bOZmBnznNrLz xUNk45+DVQpNX0U0wUeikxoMibSVY2IsmBN0iHqAiJKSGUrh+xAwA/duW48XVhIzJOuW /bHzygS7sESZ1MV9RMe0SNFIBzLbTmvMwIsutSXtUboMekUkyizB417EKHQPPrIkS6sl mWyggddvAJ+FZVqpe8Mp+p12sBMQKXjmGmNk5ahVJ1gPVHyeFV6sAdsOVQLkQl3kpHsO 0Aggh9Sh4yAm1r5v+N0iIcYC77cBLluiiRPbWEupDlngOlUZ00MCt03o5OHlalJ+rxfm r5Hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900021; x=1742504821; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2esPX4cdQSzTzPsUwG+c+ZkpjEsRNEpWoM7zt95z5xg=; b=tWcThbL23yaeDE2yKSgRjlATnLxxRkk85/HbSs4BngFqqUXtSfoAOifDiVFC6tUWs2 7UffsDC9TH4oblcxBQm2RshPw+3JfeAmWbu0oMk7hnWjQqATLnatMV/pTeMungRS7uC0 G8HgVOy8lgo5T2arh3XkkmGAstSVB3KZSuxbilXe84RpV7QNthejzITg8gMxwPOcAZDI X2b3QG0dhR0WIwCrvwr4AyqLgPeQM9NRFP3nutnSoTQmSWrQ2W6GcRfd0df65OSz2x42 FVLYwppLootdv9r3f04VV7pIICC9ZBqs6+6xbeTK9jEhpjgoYTNghkIq89+E60dXfL04 MFog== X-Forwarded-Encrypted: i=1; AJvYcCV8D4QXGwZKF9+zI4KCHHsJzbvMAGwRP1yx4+1RhonXgTB0rY/sXy+33DQVpsXZPp8OoZQgMBKoCYb0rBI=@vger.kernel.org X-Gm-Message-State: AOJu0YxU/pdgZEEGins687YTszP3twsiZxfr49HPJnz8LfTJhlCJqgte KF3kbQu5p72V0unkyfi8CIpMTO7u++GoKNy/X84wfVHg7vmtCSUHb9BRlDORii4= X-Gm-Gg: ASbGncsSksZOSkVj2UtY/s+XXZXcT16ho0EEUdjU3IcXQqOr6cEXngBZCKhi/0QNIQA MG/gZHSBZl90DwC7z3y7iE9frjcaNCDTQfRsouO7TQjjVWzt+LNqvCjshyJQaCvyXO++3MzOIyS IfNW99EBPnxIWh+8grgMWWQL3LHRdXvAoL7NtanK55lh2/gAU8WEIZ/bQ+2HiI6Lzh9vr+2IWsx atxK7cKE5CC0AIOuXzs9RXI7PXQXk1AREmt2/KvQ/nuztff9+lu8nAw74sESohxW9LYFHxMfLxT eywlNJmsI+kuNeG+ZOfU9PP4u0gaX6LvUjs+Cv65enM= X-Google-Smtp-Source: AGHT+IEDki5koJiqxSP81S0xIyeu1/cjXxLH4CEnBdYSjdRM2zBmEa/+ycr0rjzgyK/cpuSXIx1wnA== X-Received: by 2002:a05:620a:24e:b0:7c5:4194:bbd3 with SMTP id af79cd13be357-7c579f96ff8mr167000285a.44.1741900021237; Thu, 13 Mar 2025 14:07:01 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id af79cd13be357-7c573d6fb1fsm142597585a.80.2025.03.13.14.06.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:06:59 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 2/5] mm: page_alloc: trace type pollution from compaction capturing Date: Thu, 13 Mar 2025 17:05:33 -0400 Message-ID: <20250313210647.1314586-3-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When the page allocator places pages of a certain migratetype into blocks of another type, it has lasting effects on the ability to compact and defragment down the line. For improving placement and compaction, visibility into such events is crucial. The most common case, allocator fallbacks, is already annotated, but compaction capturing is also allowed to grab pages of a different type. Extend the tracepoint to cover this case. Signed-off-by: Johannes Weiner Acked-by: Zi Yan --- mm/page_alloc.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9b4a5e6dfee9..6f0404941886 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -614,6 +614,10 @@ compaction_capture(struct capture_control *capc, struc= t page *page, capc->cc->migratetype !=3D MIGRATE_MOVABLE) return false; =20 + if (migratetype !=3D capc->cc->migratetype) + trace_mm_page_alloc_extfrag(page, capc->cc->order, order, + capc->cc->migratetype, migratetype); + capc->page =3D page; return true; } --=20 2.48.1 From nobody Wed Dec 17 15:35:21 2025 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C3DF1F540F for ; Thu, 13 Mar 2025 21:07:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.43 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900027; cv=none; b=U0+DYSvQOdPezagx1L+ZUqUjbMqvhk2hAQ/BLg+pLyV/MXmx/PmGXVXH5rzzOAiRAEfC932XJPydnisZGlSNVv0SWABR8KO/DdewcPWtP2PMY/kjbFdHTIsr2XPsVd6yZVS1XTT0G43m+emF/nqgd3VUQ0pvavpKXs2Xk2OQkXI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900027; c=relaxed/simple; bh=G5viki70LrhtbkQE4Q5MpsJyR7EHFb+uxi7SbmxvjUM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Nu21ribhpi0Gb2WkLBsIpc4Mld5gaPoeyX9iDFaGVtd6fgNzxypxvZcEj2MaFUUrnPt1v7GPn+DrCEa/Uu/gK0gA37n95S9CmhEWNSGZPr6cbFe6tWnQ6aOslj7uGQvsce5gvDkNR8v1oQ0NrGyyOLwL6fKnHcaCf3HNfd1ZTuk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=VueEj8pF; arc=none smtp.client-ip=209.85.219.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="VueEj8pF" Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-6ddcff5a823so12826116d6.0 for ; Thu, 13 Mar 2025 14:07:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900024; x=1742504824; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=sZHRsjhwN4k6tUsXlsg8AVWAXyHTwCMhLnhjRmadEFs=; b=VueEj8pFsLksb5cIDfWIXaZZUXlOwlZZjHb1zEIeest6oB0PnuknAaRPdXfPlQRel2 ENPAgITF6SaTDGIpcsiPfvYEsqUa1NkhIMjUVC9roc0aKfi7rhkvwjW+A5NSvJU/pIHL Qs3VOSw2i+Mx6C0TverAVbTlnuGTHr+Wm7eD4Dk918wbWZhoy8DJwa8Qy5lXwQQlfhw7 7jHx7B+4WbSxH2QlDjEJi2+iUpSilKPmimJOpRon4FqEj6Bbn0QGV3QNawxfx9rQFOTD T2ISVwHeVEieGQmLFOiV2xAGBWl3xFe9XXKDo4tEFFgK4KVx/X0nfSpMX17MvhTonu/S z7ZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900024; x=1742504824; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sZHRsjhwN4k6tUsXlsg8AVWAXyHTwCMhLnhjRmadEFs=; b=rbNr6z8p8HXWHD27tLLsHcLSvs7MIieiAroM/qHg1BWS8AwUVzLVu+vIR776S3uPnU 94LxMaKuTOxDnz7Jl5xUdVyNTfSnStRPQOLuaXXDjGqBsrcCToolo5XC3OB93MkP1df2 3bKIJE92FbF4N7jDcj+w2Q9D13WOdsokRjhjDGFDGVIsnOLxpAkuooLXCP+hTM9oAwGb D+5S9tsAESOIi9C2Nlwq9thdj9n5GLxOkAzAEodmOKGWHvmooIuzOiL0DKHREoFFjwWO 2K8GrBAzeXWcvG8dmeu89T9ietV4HrGlS2Dik2ZSTfH/rqNu8pjzCQXBeW9EUOW3ZelE 8sCg== X-Forwarded-Encrypted: i=1; AJvYcCXx8j4WOtYkAVlgGtrMc6foM055FUbKp2jCtOEo/yrvhAeEaQvEzbikrAGRtMuJr8pNX8TsUtlS+fJLMSI=@vger.kernel.org X-Gm-Message-State: AOJu0Yxe3p4ox6Jej9Dmn3UwzY8KboRFv21JLaG7BtFuGDjrcnqIyk2D i8VZho+Nu9DrPHoWiLDLCrhLaTt2RBSLcIARXtgpO2/AbQcGnEL2YrrhuWEpwJE= X-Gm-Gg: ASbGnct9M5jUJGeuFOhot+lIQyMHXvAhxEh9B/iapvrhtQdJAyO1QJ8gK7cC5+cruUA XGdjTI5lgtpXkHF11fiVT4wuCmtvX4KtcqmNB8V0YsXrb2GvjbcJjAifQLLVX7IMtzee5OBkc54 gQLeJE57lkKvm7PaLS9f8e7M/G4m8rh9CHgvfrJUAP8ZdtgNZWwC3kaUOF/8sNaq06YpP6m9U9L d3lT1jkH/nUve674UJwhE+swsgivc7+ALPzYnmSi/47+3FftDqRgfcWNZPdqj0N983ZQNZeTLF3 bOashCoGY8JMHviFjaH4PJaYzD5VBZt8yWMdInySdw0= X-Google-Smtp-Source: AGHT+IHwO+cZb5MGTiwyC3MJf2gMVF38ko0IXPhHKQ4bI2CQqbDOb0AFS8ZUsuofmUulVbWhY9Mvjw== X-Received: by 2002:a05:6214:5299:b0:6e6:6c39:cb71 with SMTP id 6a1803df08f44-6eae7b3faa0mr22068376d6.45.1741900024009; Thu, 13 Mar 2025 14:07:04 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eade230de3sm14015326d6.28.2025.03.13.14.07.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:07:02 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/5] mm: page_alloc: defrag_mode Date: Thu, 13 Mar 2025 17:05:34 -0400 Message-ID: <20250313210647.1314586-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The page allocator groups requests by migratetype to stave off fragmentation. However, in practice this is routinely defeated by the fact that it gives up *before* invoking reclaim and compaction - which may well produce suitable pages. As a result, fragmentation of physical memory is a common ongoing process in many load scenarios. Fragmentation deteriorates compaction's ability to produce huge pages. Depending on the lifetime of the fragmenting allocations, those effects can be long-lasting or even permanent, requiring drastic measures like forcible idle states or even reboots as the only reliable ways to recover the address space for THP production. In a kernel build test with supplemental THP pressure, the THP allocation rate steadily declines over 15 runs: thp_fault_alloc 61988 56474 57258 50187 52388 55409 52925 47648 43669 40621 36077 41721 36685 34641 33215 This is a hurdle in adopting THP in any environment where hosts are shared between multiple overlapping workloads (cloud environments), and rarely experience true idle periods. To make THP a reliable and predictable optimization, there needs to be a stronger guarantee to avoid such fragmentation. Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is enforced on the allocator fastpath and the reclaiming slowpath. For now, fallbacks are permitted to avert OOMs. There is a plan to add defrag_mode=3D2 to prefer OOMs over fragmentation, but this requires additional prep work in compaction and the reserve management to make it ready for all possible allocation contexts. The following test results are from a kernel build with periodic bursts of THP allocations, over 15 runs: vanilla defrag_mode=3D1 @claimer[unmovable]: 189 103 @claimer[movable]: 92 103 @claimer[reclaimable]: 207 61 @pollute[unmovable from movable]: 25 0 @pollute[unmovable from reclaimable]: 28 0 @pollute[movable from unmovable]: 38835 0 @pollute[movable from reclaimable]: 147136 0 @pollute[reclaimable from unmovable]: 178 0 @pollute[reclaimable from movable]: 33 0 @steal[unmovable from movable]: 11 0 @steal[unmovable from reclaimable]: 5 0 @steal[reclaimable from unmovable]: 107 0 @steal[reclaimable from movable]: 90 0 @steal[movable from reclaimable]: 354 0 @steal[movable from unmovable]: 130 0 Both types of polluting fallbacks are eliminated in this workload. Interestingly, whole block conversions are reduced as well. This is because once a block is claimed for a type, its empty space remains available for future allocations, instead of being padded with fallbacks; this allows the native type to group up instead of spreading out to new blocks. The assumption in the allocator has been that pollution from movable allocations is less harmful than from other types, since they can be reclaimed or migrated out should the space be needed. However, since fallbacks occur *before* reclaim/compaction is invoked, movable pollution will still cause non-movable allocations to spread out and claim more blocks. Without fragmentation, THP rates hold steady with defrag_mode=3D1: thp_fault_alloc 32478 20725 45045 32130 14018 21711 40791 29134 34458 45381 28305 17265 22584 28454 30850 While the downward trend is eliminated, the keen reader will of course notice that the baseline rate is much smaller than the vanilla kernel's to begin with. This is due to deficiencies in how reclaim and compaction are currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller allocations are competing with THPs for pageblocks, while making no effort themselves to reclaim or compact beyond their own request size. This effect already exists with the current usage of ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole block stealing much more strongly. Subsequent patches will address defrag_mode reclaim strategy to raise the THP success baseline above the vanilla kernel. Signed-off-by: Johannes Weiner Reported-by: Brendan Jackman --- Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++ mm/page_alloc.c | 27 +++++++++++++++++++++++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-= guide/sysctl/vm.rst index ec6343ee4248..e169dbf48180 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -29,6 +29,7 @@ files can be found in mm/swap.c. - compaction_proactiveness - compaction_proactiveness_leeway - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -162,6 +163,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order t= o avoid a page fault, due to compaction, which would block the task from becoming active until the f= ault is resolved. =20 +defrag_mode +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. =20 dirty_background_bytes =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6f0404941886..9a02772c2461 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,6 +273,7 @@ int min_free_kbytes =3D 1024; int user_min_free_kbytes =3D -1; static int watermark_boost_factor __read_mostly =3D 15000; static int watermark_scale_factor =3D 10; +static int defrag_mode; =20 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -3389,6 +3390,11 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_= mask) */ alloc_flags =3D (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM); =20 + if (defrag_mode) { + alloc_flags |=3D ALLOC_NOFRAGMENT; + return alloc_flags; + } + #ifdef CONFIG_ZONE_DMA32 if (!zone) return alloc_flags; @@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int o= rder, int alloc_flags, continue; } =20 - if (no_fallback && nr_online_nodes > 1 && + if (no_fallback && !defrag_mode && nr_online_nodes > 1 && zone !=3D zonelist_zone(ac->preferred_zoneref)) { int local_nid; =20 @@ -3591,7 +3597,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int o= rder, int alloc_flags, * It's possible on a UMA machine to get through all zones that are * fragmented. If avoiding fragmentation, reset and try again. */ - if (no_fallback) { + if (no_fallback && !defrag_mode) { alloc_flags &=3D ~ALLOC_NOFRAGMENT; goto retry; } @@ -4128,6 +4134,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order) =20 alloc_flags =3D gfp_to_alloc_flags_cma(gfp_mask, alloc_flags); =20 + if (defrag_mode) + alloc_flags |=3D ALLOC_NOFRAGMENT; + return alloc_flags; } =20 @@ -4510,6 +4519,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int = order, &compaction_retries)) goto retry; =20 + /* Reclaim/compaction failed to prevent the fallback */ + if (defrag_mode) { + alloc_flags &=3D ALLOC_NOFRAGMENT; + goto retry; + } =20 /* * Deal with possible cpuset update races or zonelist updates to avoid @@ -6286,6 +6300,15 @@ static const struct ctl_table page_alloc_sysctl_tabl= e[] =3D { .extra1 =3D SYSCTL_ONE, .extra2 =3D SYSCTL_THREE_THOUSAND, }, + { + .procname =3D "defrag_mode", + .data =3D &defrag_mode, + .maxlen =3D sizeof(defrag_mode), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + .extra2 =3D SYSCTL_ONE, + }, { .procname =3D "percpu_pagelist_high_fraction", .data =3D &percpu_pagelist_high_fraction, --=20 2.48.1 From nobody Wed Dec 17 15:35:21 2025 Received: from mail-qk1-f178.google.com (mail-qk1-f178.google.com [209.85.222.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD5CE1F5615 for ; Thu, 13 Mar 2025 21:07:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900029; cv=none; b=HciRp4dqwjqTwbeFHHu6Txn7ewjClf6PA1e36MGUkUR+1HVWwUkmXiUz7LsbQwgXVLZx6hJAOPJ5lgyP/U48mt/f6hhIpCC41QHX4GYIzl8duiXs+FhOWCBgbRN0zSMF34vhQ9uD8qwUonXcezthtQGjjrD3ZxDs+qsCACp7Z0o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900029; c=relaxed/simple; bh=BVUrVAfT5xc89udf2TBcaTF/XSlcQ15FasIY5sdTWY4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mUFabFexd/1T8NdXlfS82gRcB+7+9YQmenJ0wtatKqHa+N/0Gz8PuJLz0V6awzbyWih5qOWolp4dUtT3/5NPdSEKoXeRUpDV3t9rh6EDpTT4Lw87gfxNtRqfjv61PEKRNAe8A9pX9xWFX/FghUucGvxwzYxf7t4VOKp5njpS/lw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=NELMSqTp; arc=none smtp.client-ip=209.85.222.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="NELMSqTp" Received: by mail-qk1-f178.google.com with SMTP id af79cd13be357-7c56321b22cso159391285a.1 for ; Thu, 13 Mar 2025 14:07:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900027; x=1742504827; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9ouliMx/rxg1GN/5kBba4AjsVs5wNUw9q6lxfkWoRUQ=; b=NELMSqTpH+JST81bqEKPZVzSvvwog2kVt6Csvi152M2ObpaRx0Axq2bpVQaIC2fDc+ s35ncjXZYhePJDC+h8HkEwoHKRQdSSEg290Jm7Qbq2L/xtqRWgrp9GEX/aG2flkSykl2 2oWhcSwGJVq8RurZBGLjcBBCnHJYNNMG1wByIU+U0NDSZwPmWIiDfhIvI4+4jmivzsww wtpPKK/Izvh0jEofOJ8LNO/UOocvCacN9byE417J4cGO98GeMK3RuzB0kD62Rdu04A4+ hQR+WFgiX2EKj+cCnQe72yA6FNRZ/HWqqljDFAmPsbwb+5q5PropbBgKXGzdfBxsNUvc ufhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900027; x=1742504827; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9ouliMx/rxg1GN/5kBba4AjsVs5wNUw9q6lxfkWoRUQ=; b=HNqI31BpeSvnOONt9itdyXfg4BNMPRtpgpRcGd/iTJRYzOiCccY/3jz3Go1jOxk/Pt 3dgsLeal50QpUZSTJeheKvFOYEnEGWjnFbBTVmZjSn0t4S5jB9mBnaPqDyva07/5EdHZ m2eKmHvx7XVXJtAOT/4XM7eKfc/nrO7bPQtuGmVVKfOrXW2y+6aRaZfxZ2hsCUi17LO7 6jnZgHekM0tUwiV4baaEq/QL1d53iKx7JNvAO7IOVMzrIPy77gzbCivO+J/rIH85w/i3 k8WC4Xni891ryawjU5NG9kS9s3B5mOmnMIoVK843F+kQrL15f5Cek7UGrLvnBgHT+iZw s9RQ== X-Forwarded-Encrypted: i=1; AJvYcCUW7DqnPEbVGgg7rOcFivMfc8V++oCllFYTx4F1cP6dLLoXSa+Ej4eAmEy0wbV2beQhpBcF0k2UOp2IQL0=@vger.kernel.org X-Gm-Message-State: AOJu0Yzt/0V8PCxQ2ndNG+USwJp7Xq1E4e2ZacehzWeh/qmkdV9kLQvh R459o8MpsrO2eEpjjSOlIH9EPsL6rTcO8vU2jyFD34zESXrBLTzOhef30bMFAls= X-Gm-Gg: ASbGncv1KanSKJ7n3b/50+6IryH63Xdan7K9WX/UQeTjcsWY8vsH63fXa2cDgH9biZK pc0WqHuiOUmsmxEm+xgM4PBmnulpjVZ+FsaS9nQbLP8jXcftFZ7MGAiaaXix6v86ccD56pRByJ1 oKN/TCC1UCFZ1gSO4JuLNpwLByE8hNq67HDNlfb9neLmzDuWLWLHL1YK2bkyYoCsvU1zdt5zd7h 3kfeZaX4OYg7TwExBuBRXLO+C50Qs49U5AnMB0GXE+UppJKKORdF++T46Zzc6OCFbriyV1hfykL o85jdSGC4Dp1QEiX9TDXOBugVQX0bZ1KP691mdPgF1w= X-Google-Smtp-Source: AGHT+IGURnx3waTxMIfAkGQ+RELMEfUq3qDPKdHBU4TLhM9iO7ZYMst5c+ohVVmbcycCmEztxbjgYQ== X-Received: by 2002:a05:620a:8ecb:b0:7c5:53ab:a732 with SMTP id af79cd13be357-7c5737b8b58mr536730085a.16.1741900026755; Thu, 13 Mar 2025 14:07:06 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id af79cd13be357-7c573c9d641sm143094885a.65.2025.03.13.14.07.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:07:04 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 4/5] mm: page_alloc: defrag_mode kswapd/kcompactd assistance Date: Thu, 13 Mar 2025 17:05:35 -0400 Message-ID: <20250313210647.1314586-5-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When defrag_mode is enabled, allocation fallbacks strongly prefer whole block conversions instead of polluting or stealing partially used blocks. This means there is a demand for pageblocks even from sub-block requests. Let kswapd/kcompactd help produce them. By the time kswapd gets woken up, normal rmqueue and block conversion fallbacks have been attempted and failed. So always wake kswapd with the block order; it will take care of producing a suitable compaction gap and then chain-wake kcompactd with the block order when its done. VANILLA DEFRAGMODE-A= SYNC Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.= 96%) Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.= 64%) Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.= 67%) Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.= 46%) Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.= 50%) THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.= 10%) THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.= 14%) Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.= 72%) Direct compact success 7.93 ( +0.00%) 4.53 ( -38.= 06%) Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.= 89%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.= 07%) Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.= 77%) Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.= 94%) Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.= 94%) Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.= 29%) Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.= 81%) Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.= 26%) Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.= 96%) Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.= 95%) Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.= 32%) Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.= 57%) Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.= 93%) Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.= 25%) Swap out 47959.53 ( +0.00%) 77238.20 ( +61.= 05%) Swap in 7276.00 ( +0.00%) 11712.80 ( +60.= 97%) File refaults 138043.00 ( +0.00%) 143438.80 ( +3.= 91%) With this patch, defrag_mode=3D1 beats the vanilla kernel in THP success rates and allocation latencies. The trend holds over time: thp_fault_alloc VANILLA DEFRAGMODE-ASYNC 61988 52066 56474 58844 57258 58233 50187 58476 52388 54516 55409 59938 52925 57204 47648 60238 43669 55733 40621 56211 36077 59861 41721 57771 36685 58579 34641 51868 33215 56280 DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work is shifted to kcompactd. Reclaim activity is higher. Part of that is simply due to the increased memory footprint from higher THP use. The other aspect is that *direct* reclaim/compaction are still going for requested orders rather than targeting the page blocks required for fallbacks, which is less efficient than it could be. However, this is already a useful tradeoff to make, as in many environments peak periods are short and retaining the ability to produce THP through them is more important. Signed-off-by: Johannes Weiner --- mm/page_alloc.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9a02772c2461..4a0d8f871e56 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4076,15 +4076,21 @@ static void wake_all_kswapds(unsigned int order, gf= p_t gfp_mask, struct zone *zone; pg_data_t *last_pgdat =3D NULL; enum zone_type highest_zoneidx =3D ac->highest_zoneidx; + unsigned int reclaim_order; + + if (defrag_mode) + reclaim_order =3D max(order, pageblock_order); + else + reclaim_order =3D order; =20 for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, highest_zoneidx, ac->nodemask) { if (!managed_zone(zone)) continue; - if (last_pgdat !=3D zone->zone_pgdat) { - wakeup_kswapd(zone, gfp_mask, order, highest_zoneidx); - last_pgdat =3D zone->zone_pgdat; - } + if (last_pgdat =3D=3D zone->zone_pgdat) + continue; + wakeup_kswapd(zone, gfp_mask, reclaim_order, highest_zoneidx); + last_pgdat =3D zone->zone_pgdat; } } =20 --=20 2.48.1 From nobody Wed Dec 17 15:35:21 2025 Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B050F1F584F for ; Thu, 13 Mar 2025 21:07:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900033; cv=none; b=HxwZCkSffrgrS/LILeCw/sHqmzxPlYJ+oU5fBRu8/tYiJOigIJIDUPshVIBzE2CUnXaIR12A9YMrQ1gtlsrS2oojLStW5ngIkq7NRRja9MHF2KUUliKS+4J5GmoPcyUdeJsGASdH3NI6Noxvo1qxW4k2YxKOdnTMpJC/DvJhCC8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741900033; c=relaxed/simple; bh=h9NgYKwsAoW1G9lRKhZDPMw2Rwi0/+g2kl3mRilnujQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U/vasiPXiD5K5MWSaCuq4xJ2p+t7qGesiSu+DssANOIvBLoJykzFtsggJMOezLJWHElXhczKKZeUYrJDTWhinxr6QSNcYlFmZhWzW5Mz9dwFY9YeUsgSgMEwVlF23re6lW8OZQE4Li+ldw0Vtui2xv1Y+g+5wGFdmRUq7ZD7Hao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b=shxmU4oS; arc=none smtp.client-ip=209.85.219.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20230601.gappssmtp.com header.i=@cmpxchg-org.20230601.gappssmtp.com header.b="shxmU4oS" Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6e8fd49b85eso21058966d6.0 for ; Thu, 13 Mar 2025 14:07:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1741900030; x=1742504830; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=omkCeK6Z+5R8kI2i4e0zD4yPuPDofJe9UMJlISCGlWc=; b=shxmU4oSLlvGmnNyfB8Y/qf5Kq3hvrFHK68nIOuyTZi3GTngiw2sNEW46bceFc+Ydl LbH6xshGxwUS5HQDGiRen+zzqjv2z7Iu/0h8hrQAsvPX3D2xCHSPB87CAMZpyIjdcNyO nZa8Id34C9K2n8MUa++FuLKh3FgGfT7huhPdvmhkm470ojl6Q9fdbH+66swxsR6YS35w GYkvQ5XFuRRZSvIThzrh65azoWP28q4O6MnDL3/voRgousXtMKoe7f2jp44L0vKaEGI2 d3lbTxqMCudu2U4N9kJuZATfQ37hzP5vj/HS2wy8wYmGAVwt6hZgD654MqxkE8NTTVQ1 25Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741900030; x=1742504830; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=omkCeK6Z+5R8kI2i4e0zD4yPuPDofJe9UMJlISCGlWc=; b=rUh4jCF7WwWi0QbK8ejV8j7s/Mn9RfJku31NKjHkRDmu8GGaITXgWhCsvbIMRh7DXk XtpmavSV0pOyxUJUfqJDDxI7lf2hptwTl2qJfEmkIfbv5pS6dfwJOPXnwheuNY/28Hr/ 6JKQuFlTMsXc7mJOG8WwymRWau5aVkSO34FnrP8Wpl/cLmly54g/6UoQo31he50vlHWo LMIARI7NTcs/Hy5qAiEmHmZs7RWwnuqwtGU7uyBCuTa58ne/5a2qs2Cq6CnoXZo1gVn9 CIwnLCAPgUukdo5v0erVqqBFyKsXCUHvd2mm7D5gXASE8NXPG0nGdVD5u7F0nfE9Bb2G vexg== X-Forwarded-Encrypted: i=1; AJvYcCUeI84V417z5okQEGiPYCEIM6nq7LhOEbzf3ucrEgsXwNBp6m5UfZ+TYJN04gyUwJp9Zgpz7pVQQSBq+ws=@vger.kernel.org X-Gm-Message-State: AOJu0Yx6ZOD2CFVuXPovmjfnd0NHylSrwDgpGJJFClyHp7Cj5InqW9AI 5cqynZ1zdsYtROMHFBI+LIhn5Ih9bSw7vWKTZAtezVFPnwCJh6Uq00CasXEk4nc= X-Gm-Gg: ASbGncuDK9e0NUNRdEaLpCZ85VXTalnwgZ5Aok/xryP2wXSATT5bcOvc50dpSqHEOU1 eWTU7Duw7FNlXPzti3+zLEYdILlCx6Dlrx7pRtSn9UvLrtSqAhpKsIkTOKVo0/bfZ+rwfSKSz/q iZOsDx0fvNoOUzbb4xmiIofHqW8b2xfVZUwdp497ojcOYW3FJa8PrFPCPz5l4qssBN8iOeS19oX qziPBeT+Y6yJEevr17/K6pbFg2D8XzVlo/Z7qSb25btMmslecFapfTTghJlEpqTvC0/gNSkSKM9 duUphmuOSs79ZmObtTnEMFCqQN4bXB99+ldWpB24YwIMtUh/Ew3tUA== X-Google-Smtp-Source: AGHT+IE0pgmpJ7iYZdsbDD5GRPXRpzmD9Mb8iJmN9X7Z8Y97UKXrjLpuHa7ASqYiLMkPwrW1/Ylm4w== X-Received: by 2002:a05:6214:212b:b0:6e8:f470:2b0d with SMTP id 6a1803df08f44-6eae7a3f38cmr22090436d6.19.1741900030398; Thu, 13 Mar 2025 14:07:10 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6eade235f9bsm14078246d6.29.2025.03.13.14.07.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Mar 2025 14:07:08 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Vlastimil Babka , Mel Gorman , Zi Yan , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 5/5] mm: page_alloc: defrag_mode kswapd/kcompactd watermarks Date: Thu, 13 Mar 2025 17:05:36 -0400 Message-ID: <20250313210647.1314586-6-hannes@cmpxchg.org> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250313210647.1314586-1-hannes@cmpxchg.org> References: <20250313210647.1314586-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be. To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks. To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value. This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch: DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WM= ARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.= 73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.= 04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.= 23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.= 25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.= 54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.= 81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.= 26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.= 79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.= 33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.= 66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.= 48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.= 83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.= 30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.= 21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.= 05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.= 36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.= 62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.= 63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.= 41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.= 81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.= 95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.= 95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.= 43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.= 53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.= 80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.= 22%) Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction. Comparing all changes to the vanilla kernel: VANILLA DEFRAGMODE-ASYNC-WM= ARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.= 19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.= 81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.= 44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.= 71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.= 45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.= 30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.= 29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.= 48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.= 46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.= 49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.= 71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.= 90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.= 54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.= 08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.= 44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.= 56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.= 02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.= 21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.= 77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.= 31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.= 00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.= 12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.= 46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.= 53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.= 10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.= 35%) THP allocation latencies and %sys time are down dramatically. THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events. Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less. Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP. However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination. Signed-off-by: Johannes Weiner --- include/linux/mmzone.h | 1 + mm/compaction.c | 37 ++++++++++++++++++++++++++++++------- mm/internal.h | 1 + mm/page_alloc.c | 29 +++++++++++++++++++++++------ mm/vmscan.c | 15 ++++++++++++++- mm/vmstat.c | 1 + 6 files changed, 70 insertions(+), 14 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dbb0ad69e17f..37c29f3fbca8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -138,6 +138,7 @@ enum numa_stat_item { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, + NR_FREE_PAGES_BLOCKS, NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ NR_ZONE_INACTIVE_ANON =3D NR_ZONE_LRU_BASE, NR_ZONE_ACTIVE_ANON, diff --git a/mm/compaction.c b/mm/compaction.c index 036353ef1878..4a2ccb82d0b2 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2329,6 +2329,22 @@ static enum compact_result __compact_finished(struct= compact_control *cc) if (!pageblock_aligned(cc->migrate_pfn)) return COMPACT_CONTINUE; =20 + /* + * When defrag_mode is enabled, make kcompactd target + * watermarks in whole pageblocks. Because they can be stolen + * without polluting, no further fallback checks are needed. + */ + if (defrag_mode && !cc->direct_compaction) { + if (__zone_watermark_ok(cc->zone, cc->order, + high_wmark_pages(cc->zone), + cc->highest_zoneidx, cc->alloc_flags, + zone_page_state(cc->zone, + NR_FREE_PAGES_BLOCKS))) + return COMPACT_SUCCESS; + + return COMPACT_CONTINUE; + } + /* Direct compactor: Is a suitable page free? */ ret =3D COMPACT_NO_SUITABLE_PAGE; for (order =3D cc->order; order < NR_PAGE_ORDERS; order++) { @@ -2496,13 +2512,19 @@ bool compaction_zonelist_suitable(struct alloc_cont= ext *ac, int order, static enum compact_result compaction_suit_allocation_order(struct zone *zone, unsigned int order, int highest_zoneidx, unsigned int alloc_flags, - bool async) + bool async, bool kcompactd) { + unsigned long free_pages; unsigned long watermark; =20 + if (kcompactd && defrag_mode) + free_pages =3D zone_page_state(zone, NR_FREE_PAGES_BLOCKS); + else + free_pages =3D zone_page_state(zone, NR_FREE_PAGES); + watermark =3D wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); - if (zone_watermark_ok(zone, order, watermark, highest_zoneidx, - alloc_flags)) + if (__zone_watermark_ok(zone, order, watermark, highest_zoneidx, + alloc_flags, free_pages)) return COMPACT_SUCCESS; =20 /* @@ -2558,7 +2580,8 @@ compact_zone(struct compact_control *cc, struct captu= re_control *capc) ret =3D compaction_suit_allocation_order(cc->zone, cc->order, cc->highest_zoneidx, cc->alloc_flags, - cc->mode =3D=3D MIGRATE_ASYNC); + cc->mode =3D=3D MIGRATE_ASYNC, + !cc->direct_compaction); if (ret !=3D COMPACT_CONTINUE) return ret; } @@ -3062,7 +3085,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) ret =3D compaction_suit_allocation_order(zone, pgdat->kcompactd_max_order, highest_zoneidx, ALLOC_WMARK_MIN, - false); + false, true); if (ret =3D=3D COMPACT_CONTINUE) return true; } @@ -3085,7 +3108,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) .mode =3D MIGRATE_SYNC_LIGHT, .ignore_skip_hint =3D false, .gfp_mask =3D GFP_KERNEL, - .alloc_flags =3D ALLOC_WMARK_MIN, + .alloc_flags =3D ALLOC_WMARK_HIGH, }; enum compact_result ret; =20 @@ -3105,7 +3128,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) =20 ret =3D compaction_suit_allocation_order(zone, cc.order, zoneid, cc.alloc_flags, - false); + false, true); if (ret !=3D COMPACT_CONTINUE) continue; =20 diff --git a/mm/internal.h b/mm/internal.h index 2f52a65272c1..286520a424fe 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -536,6 +536,7 @@ extern char * const zone_names[MAX_NR_ZONES]; DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled); =20 extern int min_free_kbytes; +extern int defrag_mode; =20 void setup_per_zone_wmarks(void); void calculate_min_free_kbytes(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4a0d8f871e56..c33c08e278f9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,7 +273,7 @@ int min_free_kbytes =3D 1024; int user_min_free_kbytes =3D -1; static int watermark_boost_factor __read_mostly =3D 15000; static int watermark_scale_factor =3D 10; -static int defrag_mode; +int defrag_mode; =20 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -660,16 +660,20 @@ static inline void __add_to_free_list(struct page *pa= ge, struct zone *zone, bool tail) { struct free_area *area =3D &zone->free_area[order]; + int nr_pages =3D 1 << order; =20 VM_WARN_ONCE(get_pageblock_migratetype(page) !=3D migratetype, "page type is %lu, passed migratetype is %d (nr=3D%d)\n", - get_pageblock_migratetype(page), migratetype, 1 << order); + get_pageblock_migratetype(page), migratetype, nr_pages); =20 if (tail) list_add_tail(&page->buddy_list, &area->free_list[migratetype]); else list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; + + if (order >=3D pageblock_order && !is_migrate_isolate(migratetype)) + __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages); } =20 /* @@ -681,24 +685,34 @@ static inline void move_to_free_list(struct page *pag= e, struct zone *zone, unsigned int order, int old_mt, int new_mt) { struct free_area *area =3D &zone->free_area[order]; + int nr_pages =3D 1 << order; =20 /* Free page moving can fail, so it happens before the type update */ VM_WARN_ONCE(get_pageblock_migratetype(page) !=3D old_mt, "page type is %lu, passed migratetype is %d (nr=3D%d)\n", - get_pageblock_migratetype(page), old_mt, 1 << order); + get_pageblock_migratetype(page), old_mt, nr_pages); =20 list_move_tail(&page->buddy_list, &area->free_list[new_mt]); =20 - account_freepages(zone, -(1 << order), old_mt); - account_freepages(zone, 1 << order, new_mt); + account_freepages(zone, -nr_pages, old_mt); + account_freepages(zone, nr_pages, new_mt); + + if (order >=3D pageblock_order && + is_migrate_isolate(old_mt) !=3D is_migrate_isolate(new_mt)) { + if (!is_migrate_isolate(old_mt)) + nr_pages =3D -nr_pages; + __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages); + } } =20 static inline void __del_page_from_free_list(struct page *page, struct zon= e *zone, unsigned int order, int migratetype) { + int nr_pages =3D 1 << order; + VM_WARN_ONCE(get_pageblock_migratetype(page) !=3D migratetype, "page type is %lu, passed migratetype is %d (nr=3D%d)\n", - get_pageblock_migratetype(page), migratetype, 1 << order); + get_pageblock_migratetype(page), migratetype, nr_pages); =20 /* clear reported state and update reported page count */ if (page_reported(page)) @@ -708,6 +722,9 @@ static inline void __del_page_from_free_list(struct pag= e *page, struct zone *zon __ClearPageBuddy(page); set_page_private(page, 0); zone->free_area[order].nr_free--; + + if (order >=3D pageblock_order && !is_migrate_isolate(migratetype)) + __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages); } =20 static inline void del_page_from_free_list(struct page *page, struct zone = *zone, diff --git a/mm/vmscan.c b/mm/vmscan.c index 3370bdca6868..b5c7dfc2b189 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int or= der, int highest_zoneidx) * meet watermarks. */ for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) { + unsigned long free_pages; + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) mark =3D promo_wmark_pages(zone); else mark =3D high_wmark_pages(zone); - if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx)) + + /* + * In defrag_mode, watermarks must be met in whole + * blocks to avoid polluting allocator fallbacks. + */ + if (defrag_mode) + free_pages =3D zone_page_state(zone, NR_FREE_PAGES_BLOCKS); + else + free_pages =3D zone_page_state(zone, NR_FREE_PAGES); + + if (__zone_watermark_ok(zone, order, mark, highest_zoneidx, + 0, free_pages)) return true; } =20 diff --git a/mm/vmstat.c b/mm/vmstat.c index 16bfe1c694dd..ed49a86348f7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1190,6 +1190,7 @@ int fragmentation_index(struct zone *zone, unsigned i= nt order) const char * const vmstat_text[] =3D { /* enum zone_stat_item counters */ "nr_free_pages", + "nr_free_pages_blocks", "nr_zone_inactive_anon", "nr_zone_active_anon", "nr_zone_inactive_file", --=20 2.48.1