[RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot

Bharata B Rao posted 5 patches 1 week, 3 days ago
[RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot
Posted by Bharata B Rao 1 week, 3 days ago
Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
mode of NUMA Balancing) does hot page detection (via hint faults),
hot page classification and eventual promotion, all by itself and
sits within the scheduler.

With pghot, the new hot page tracking and promotion mechanism
being available, NUMA Balancing can limit itself to detection
of hot pages (via hint faults) and off-load rest of the
functionality to pghot.

To achieve this, pghot_record_access(PGHOT_HINT_FAULT) API
is used to feed the hot page info to pghot. In addition, the
migration rate limiting and dynamic threshold logic are moved to
kmigrated so that the same can be used for hot pages reported by
other sources too. Hence it becomes necessary to introduce a
new config option CONFIG_NUMA_BALANCING_TIERING to control
the hint faults souce for hot page promotion. This option
controls the NUMA_BALANCING_MEMORY_TIERING mode of
kernel.numa_balancing

This movement of hot page promotion to pghot results in the following
changes to the behaviour of hint faults based hot page promotion:

1. Promotion is no longer done in the fault path but instead is
   deferred to kmigrated and happens in batches.
2. NUMA_BALANCING_MEMORY_TIERING mode used to promote on first
   access. Pghot by default, promotes on second access though this
   can be changed by setting /sys/kernel/debug/pghot/freq_threshold.
   hot_threshold_ms debugfs tunable now gets replaced by pghot's
   freq_threshold.
3. In NUMA_BALANCING_MEMORY_TIERING mode, hint fault latency is the
   difference between the PTE update time (during scanning) and the
   access time (hint fault). However with pghot, a single latency
   threshold is used for two purposes:
   a) If the time difference between successive accesses are within
      the threshold, the page is marked as hot.
   b) Later when kmigrated picks up the page for migration, it will
      migrate only if the difference between the current time and
      the time when the page was marked hot is with the threshold.
4. Batch migration of misplaced folios is done from non-process
   context where VMA info is not readily available. Without VMA
   and the exec check on that, it will not be possible to filter
   out exec pages during migration prep stage. Hence shared
   executable pages also will be subjected to misplaced migration.
5. The max scan period which is used in dynamic threshold logic
   was a debugfs tunable. However this has been converted to a
   scalar metric in pghot.

Key code changes due to this movement are detailed below to help
easy understanding of the restructuring.

1. Scanning and access times are no longer tracked in last_cpupid
   field of folio flags. Hence all code related to this (like
   folio_xchg_access_time(), cpupid_valid()) are removed.
2. The misplaced migration routines become conditional to
   CONFIG_PGHOT in addition to CONFIG_NUMA_BALANCING.
3. The promotion related stats (like PGPROMOTE_SUCCESS etc) are
   now moved to under CONFIG_PGHOT as these stats are part of
   promotion engine which will be used for other hotness sources
   as well.
4. Routines that are responsibile for migration rate limiting
   dynamic thresholding, pgdat balancing during promotion etc
   are moved to pghot with appropriate renaming.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mm.h     |  35 ++------
 include/linux/mmzone.h |   4 +-
 init/Kconfig           |  13 +++
 kernel/sched/core.c    |   7 ++
 kernel/sched/debug.c   |   1 -
 kernel/sched/fair.c    | 177 ++---------------------------------------
 kernel/sched/sched.h   |   1 -
 mm/huge_memory.c       |  27 ++++++-
 mm/memcontrol.c        |   6 +-
 mm/memory-tiers.c      |  15 ++--
 mm/memory.c            |  36 +++++++--
 mm/mempolicy.c         |   3 -
 mm/migrate.c           |  16 +++-
 mm/pghot.c             | 134 +++++++++++++++++++++++++++++++
 mm/vmstat.c            |   2 +-
 15 files changed, 248 insertions(+), 229 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index abb4963c1f06..81249a06dfeb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1998,17 +1998,6 @@ static inline int folio_nid(const struct folio *folio)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
-/* page access time bits needs to hold at least 4 seconds */
-#define PAGE_ACCESS_TIME_MIN_BITS	12
-#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
-#define PAGE_ACCESS_TIME_BUCKETS				\
-	(PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT)
-#else
-#define PAGE_ACCESS_TIME_BUCKETS	0
-#endif
-
-#define PAGE_ACCESS_TIME_MASK				\
-	(LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS)
 
 static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
@@ -2074,15 +2063,6 @@ static inline void page_cpupid_reset_last(struct page *page)
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
-	int last_time;
-
-	last_time = folio_xchg_last_cpupid(folio,
-					   time >> PAGE_ACCESS_TIME_BUCKETS);
-	return last_time << PAGE_ACCESS_TIME_BUCKETS;
-}
-
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 	unsigned int pid_bit;
@@ -2093,18 +2073,12 @@ static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 	}
 }
 
-bool folio_use_access_time(struct folio *folio);
 #else /* !CONFIG_NUMA_BALANCING */
 static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid)
 {
 	return folio_nid(folio); /* XXX */
 }
 
-static inline int folio_xchg_access_time(struct folio *folio, int time)
-{
-	return 0;
-}
-
 static inline int folio_last_cpupid(struct folio *folio)
 {
 	return folio_nid(folio); /* XXX */
@@ -2147,11 +2121,16 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 }
-static inline bool folio_use_access_time(struct folio *folio)
+#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_NUMA_BALANCING_TIERING
+bool folio_is_promo_candidate(struct folio *folio);
+#else
+static inline bool folio_is_promo_candidate(struct folio *folio)
 {
 	return false;
 }
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_NUMA_BALANCING_TIERING */
 
 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 61fd259d9897..bfaaa757b19c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -232,7 +232,7 @@ enum node_stat_item {
 #ifdef CONFIG_SWAP
 	NR_SWAPCACHE,
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	/**
 	 * Candidate pages for promotion based on hint fault latency.  This
@@ -1475,7 +1475,7 @@ typedef struct pglist_data {
 	struct deferred_split deferred_split_queue;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	/* start time in ms of current promote rate limit period */
 	unsigned int nbp_rl_start;
 	/* number of promote candidate pages at start time of current rate limit period */
diff --git a/init/Kconfig b/init/Kconfig
index 444ce811ea67..56ef148487fa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1013,6 +1013,19 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
 
+config NUMA_BALANCING_TIERING
+	bool "NUMA balancing memory tiering promotion"
+	depends on NUMA_BALANCING && PGHOT
+	help
+	  Enable NUMA balancing mode 2 (memory tiering). This allows
+	  automatic promotion of hot pages from slower memory tiers to
+	  faster tiers using the pghot subsystem.
+
+	  This requires CONFIG_PGHOT for the hot page tracking engine.
+	  This option is required for kernel.numa_balancing=2.
+
+	  If unsure, say N.
+
 config SLAB_OBJ_EXT
 	bool
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..f8ca5dff9cad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4463,6 +4463,7 @@ void set_numabalancing_state(bool enabled)
 }
 
 #ifdef CONFIG_PROC_SYSCTL
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 static void reset_memory_tiering(void)
 {
 	struct pglist_data *pgdat;
@@ -4473,6 +4474,7 @@ static void reset_memory_tiering(void)
 		pgdat->nbp_th_start = jiffies_to_msecs(jiffies);
 	}
 }
+#endif
 
 static int sysctl_numa_balancing(const struct ctl_table *table, int write,
 			  void *buffer, size_t *lenp, loff_t *ppos)
@@ -4490,9 +4492,14 @@ static int sysctl_numa_balancing(const struct ctl_table *table, int write,
 	if (err < 0)
 		return err;
 	if (write) {
+		if ((state & NUMA_BALANCING_MEMORY_TIERING) &&
+		    !IS_ENABLED(CONFIG_NUMA_BALANCING_TIERING))
+			return -EOPNOTSUPP;
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
 		    (state & NUMA_BALANCING_MEMORY_TIERING))
 			reset_memory_tiering();
+#endif
 		sysctl_numa_balancing_mode = state;
 		__set_numabalancing_state(state);
 	}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b24f40f05019..c6a3325ebbd2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -622,7 +622,6 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
 	debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
 	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
-	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..131fc4bb1fa7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu)
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
-#endif
-
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_fair_sysctls[] = {
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	{
-		.procname	= "numa_balancing_promote_rate_limit_MBps",
-		.data		= &sysctl_numa_balancing_promote_rate_limit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
-#endif /* CONFIG_NUMA_BALANCING */
 };
 
 static int __init sched_fair_sysctl_init(void)
@@ -1519,9 +1504,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
-/* The page with hint page fault latency < threshold in ms is considered hot */
-unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
-
 struct numa_group {
 	refcount_t refcount;
 
@@ -1864,120 +1846,6 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
 	return 1000 * faults / total_faults;
 }
 
-/*
- * If memory tiering mode is enabled, cpupid of slow memory page is
- * used to record scan time instead of CPU and PID.  When tiering mode
- * is disabled at run time, the scan time (in cpupid) will be
- * interpreted as CPU and PID.  So CPU needs to be checked to avoid to
- * access out of array bound.
- */
-static inline bool cpupid_valid(int cpupid)
-{
-	return cpupid_to_cpu(cpupid) < nr_cpu_ids;
-}
-
-/*
- * For memory tiering mode, if there are enough free pages (more than
- * enough watermark defined here) in fast memory node, to take full
- * advantage of fast memory capacity, all recently accessed slow
- * memory pages will be migrated to fast memory node without
- * considering hot threshold.
- */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
-{
-	int z;
-	unsigned long enough_wmark;
-
-	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
-			   pgdat->node_present_pages >> 4);
-	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		if (zone_watermark_ok(zone, 0,
-				      promo_wmark_pages(zone) + enough_wmark,
-				      ZONE_MOVABLE, 0))
-			return true;
-	}
-	return false;
-}
-
-/*
- * For memory tiering mode, when page tables are scanned, the scan
- * time will be recorded in struct page in addition to make page
- * PROT_NONE for slow memory page.  So when the page is accessed, in
- * hint page fault handler, the hint page fault latency is calculated
- * via,
- *
- *	hint page fault latency = hint page fault time - scan time
- *
- * The smaller the hint page fault latency, the higher the possibility
- * for the page to be hot.
- */
-static int numa_hint_fault_latency(struct folio *folio)
-{
-	int last_time, time;
-
-	time = jiffies_to_msecs(jiffies);
-	last_time = folio_xchg_access_time(folio, time);
-
-	return (time - last_time) & PAGE_ACCESS_TIME_MASK;
-}
-
-/*
- * For memory tiering mode, too high promotion/demotion throughput may
- * hurt application latency.  So we provide a mechanism to rate limit
- * the number of pages that are tried to be promoted.
- */
-static bool numa_promotion_rate_limit(struct pglist_data *pgdat,
-				      unsigned long rate_limit, int nr)
-{
-	unsigned long nr_cand;
-	unsigned int now, start;
-
-	now = jiffies_to_msecs(jiffies);
-	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
-	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-	start = pgdat->nbp_rl_start;
-	if (now - start > MSEC_PER_SEC &&
-	    cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
-		pgdat->nbp_rl_nr_cand = nr_cand;
-	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
-		return true;
-	return false;
-}
-
-#define NUMA_MIGRATION_ADJUST_STEPS	16
-
-static void numa_promotion_adjust_threshold(struct pglist_data *pgdat,
-					    unsigned long rate_limit,
-					    unsigned int ref_th)
-{
-	unsigned int now, start, th_period, unit_th, th;
-	unsigned long nr_cand, ref_cand, diff_cand;
-
-	now = jiffies_to_msecs(jiffies);
-	th_period = sysctl_numa_balancing_scan_period_max;
-	start = pgdat->nbp_th_start;
-	if (now - start > th_period &&
-	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
-		ref_cand = rate_limit *
-			sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC;
-		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
-		unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS;
-		th = pgdat->nbp_threshold ? : ref_th;
-		if (diff_cand > ref_cand * 11 / 10)
-			th = max(th - unit_th, unit_th);
-		else if (diff_cand < ref_cand * 9 / 10)
-			th = min(th + unit_th, ref_th * 2);
-		pgdat->nbp_th_nr_cand = nr_cand;
-		pgdat->nbp_threshold = th;
-	}
-}
-
 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu)
 {
@@ -1993,41 +1861,15 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 
 	/*
 	 * The pages in slow memory node should be migrated according
-	 * to hot/cold instead of private/shared.
-	 */
-	if (folio_use_access_time(folio)) {
-		struct pglist_data *pgdat;
-		unsigned long rate_limit;
-		unsigned int latency, th, def_th;
-		long nr = folio_nr_pages(folio);
-
-		pgdat = NODE_DATA(dst_nid);
-		if (pgdat_free_space_enough(pgdat)) {
-			/* workload changed, reset hot threshold */
-			pgdat->nbp_threshold = 0;
-			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr);
-			return true;
-		}
-
-		def_th = sysctl_numa_balancing_hot_threshold;
-		rate_limit = MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit);
-		numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
-
-		th = pgdat->nbp_threshold ? : def_th;
-		latency = numa_hint_fault_latency(folio);
-		if (latency >= th)
-			return false;
-
-		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
-	}
+	 * to hot/cold instead of private/shared. Also the migration
+	 * of such pages are handled by kmigrated.
+	 */
+	if (folio_is_promo_candidate(folio))
+		return true;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
 
-	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
-	    !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid))
-		return false;
-
 	/*
 	 * Allow first faults or private faults to migrate immediately early in
 	 * the lifetime of a task. The magic number 4 is based on waiting for
@@ -3237,15 +3079,6 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	if (!p->mm)
 		return;
 
-	/*
-	 * NUMA faults statistics are unnecessary for the slow memory
-	 * node for memory tiering mode.
-	 */
-	if (!node_is_toptier(mem_node) &&
-	    (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
-	     !cpupid_valid(last_cpupid)))
-		return;
-
 	/* Allocate buffer to track faults on a per-node basis */
 	if (unlikely(!p->numa_faults)) {
 		int size = sizeof(*p->numa_faults) *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca..a47f7e3d51a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3021,7 +3021,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
-extern unsigned int sysctl_numa_balancing_hot_threshold;
 
 #ifdef CONFIG_SCHED_HRTICK
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b298cba853ab..fe957ff91df9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
 #include <linux/pgalloc.h>
 #include <linux/pgalloc_tag.h>
 #include <linux/pagewalk.h>
+#include <linux/pghot.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2190,7 +2191,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	int nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid;
 	pmd_t pmd, old_pmd;
-	bool writable = false;
+	bool writable = false, needs_promotion = false;
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 		goto out_map;
 
 	nid = folio_nid(folio);
+	needs_promotion = folio_is_promo_candidate(folio);
 
 	target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
 					&last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
+
+	if (needs_promotion) {
+		/*
+		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+		 * Isolation and migration are handled by pghot.
+		 *
+		 * TODO: mode2 check
+		 */
+		writable = false;
+		nid = target_nid;
+		goto out_map;
+	}
+
+	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
 	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
 		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
@@ -2253,8 +2269,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	if (nid != NUMA_NO_NODE) {
+		if (needs_promotion)
+			pghot_record_access(folio_pfn(folio), nid,
+					    PGHOT_HINTFAULTS, jiffies);
+		else
+			task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..fcd92f2ffd0c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -323,7 +323,7 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_SWAP
 	NR_SWAPCACHE,
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	PGPROMOTE_SUCCESS,
 #endif
 	PGDEMOTE_KSWAPD,
@@ -1400,7 +1400,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "pgdemote_direct",		PGDEMOTE_DIRECT		},
 	{ "pgdemote_khugepaged",	PGDEMOTE_KHUGEPAGED	},
 	{ "pgdemote_proactive",		PGDEMOTE_PROACTIVE	},
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
 };
@@ -1443,7 +1443,7 @@ static int memcg_page_state_output_unit(int item)
 	case PGDEMOTE_DIRECT:
 	case PGDEMOTE_KHUGEPAGED:
 	case PGDEMOTE_PROACTIVE:
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	case PGPROMOTE_SUCCESS:
 #endif
 		return 1;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 986f809376eb..7303dc10035c 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,18 +51,19 @@ static const struct bus_type memory_tier_subsys = {
 	.dev_name = "memory_tier",
 };
 
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 /**
- * folio_use_access_time - check if a folio reuses cpupid for page access time
+ * folio_is_promo_candidate - check if the folio qualifies for promotion
+ *
  * @folio: folio to check
  *
- * folio's _last_cpupid field is repurposed by memory tiering. In memory
- * tiering mode, cpupid of slow memory folio (not toptier memory) is used to
- * record page access time.
+ * Checks if NUMA Balancing tiering mode is set and the folio belongs
+ * to lower tier. If so, it qualifies for promotion to toptier when
+ * it is categorized as hot.
  *
- * Return: the folio _last_cpupid is used to record page access time
+ * Return: True if the above condition is met, else False.
  */
-bool folio_use_access_time(struct folio *folio)
+bool folio_is_promo_candidate(struct folio *folio)
 {
 	return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
 	       !node_is_toptier(folio_nid(folio));
diff --git a/mm/memory.c b/mm/memory.c
index 2f815a34d924..289fa6c07a42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/pghot.h>
 #include <linux/sched/sysctl.h>
 #include <linux/pgalloc.h>
 #include <linux/uaccess.h>
@@ -5968,10 +5969,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
 		*flags |= TNF_SHARED;
 	/*
-	 * For memory tiering mode, cpupid of slow memory page is used
-	 * to record page access time.  So use default value.
+	 * For memory tiering mode, last_cpupid is unused. So use default value.
 	 */
-	if (folio_use_access_time(folio))
+	if (folio_is_promo_candidate(folio))
 		*last_cpupid = (-1 & LAST_CPUPID_MASK);
 	else
 		*last_cpupid = folio_last_cpupid(folio);
@@ -6052,6 +6052,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	int nid = NUMA_NO_NODE;
 	bool writable = false, ignore_writable = false;
 	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
+	bool needs_promotion = false;
 	int last_cpupid;
 	int target_nid;
 	pte_t pte, old_pte;
@@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out_map;
 
 	nid = folio_nid(folio);
+	needs_promotion = folio_is_promo_candidate(folio);
 	nr_pages = folio_nr_pages(folio);
 
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
-	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
+
+	if (needs_promotion) {
+		/*
+		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
+		 * Isolation and migration are handled by pghot.
+		 */
+		writable = false;
+		ignore_writable = true;
+		nid = target_nid;
+		goto out_map;
+	}
+
+	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
+	if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) {
 		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
+
 	/* The folio is isolated and isolation code holds a folio reference. */
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	writable = false;
@@ -6110,7 +6126,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	}
 
 	flags |= TNF_MIGRATE_FAIL;
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
 				       vmf->address, &vmf->ptl);
 	if (unlikely(!vmf->pte))
 		return 0;
@@ -6118,6 +6134,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return 0;
 	}
+
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
@@ -6131,8 +6148,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					    writable);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	if (nid != NUMA_NO_NODE) {
+		if (needs_promotion)
+			pghot_record_access(folio_pfn(folio), nid,
+					    PGHOT_HINTFAULTS, jiffies);
+		else
+			task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e5175f1c767..6eed217a5917 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -866,9 +866,6 @@ bool folio_can_map_prot_numa(struct folio *folio, struct vm_area_struct *vma,
 	    node_is_toptier(nid))
 		return false;
 
-	if (folio_use_access_time(folio))
-		folio_xchg_access_time(folio, jiffies_to_msecs(jiffies));
-
 	return true;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a5f48984ed3e..db6832b4b95b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2690,8 +2690,18 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
 		int z;
 
-		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
+		/*
+		 * Kswapd wakeup for creating headroom in toptier is done only
+		 * for hot page promotion case and not for misplaced migrations
+		 * between toptier nodes.
+		 *
+		 * In the uncommon case of using NUMA_BALANCING_NORMAL mode
+		 * to balance between lower and higher tier nodes, we end up
+		 * up waking up kswapd.
+		 */
+		if (node_is_toptier(folio_nid(folio)))
 			return -EAGAIN;
+
 		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 			if (managed_zone(pgdat->node_zones + z))
 				break;
@@ -2741,6 +2751,8 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 #ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_NUMA_BALANCING_TIERING
 		if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
 		    && !node_is_toptier(folio_nid(folio))
 		    && node_is_toptier(node)) {
@@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
 #ifdef CONFIG_NUMA_BALANCING
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
 		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
+#endif
+#ifdef CONFIG_PGHOT
 		mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
 #endif
 	}
diff --git a/mm/pghot.c b/mm/pghot.c
index 7d7ef0800ae2..3c0ba254ad4c 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -17,6 +17,9 @@
  * the hot pages. kmigrated runs for each lower tier node. It iterates
  * over the node's PFNs and  migrates pages marked for migration into
  * their targeted nodes.
+ *
+ * Migration rate-limiting and dynamic threshold logic implementations
+ * were moved from NUMA Balancing mode 2.
  */
 #include <linux/mm.h>
 #include <linux/migrate.h>
@@ -32,6 +35,12 @@ unsigned int kmigrated_batch_nr = KMIGRATED_DEFAULT_BATCH_NR;
 
 unsigned int sysctl_pghot_freq_window = PGHOT_DEFAULT_FREQ_WINDOW;
 
+/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
+static unsigned int sysctl_pghot_promote_rate_limit = 65536;
+
+#define KMIGRATED_MIGRATION_ADJUST_STEPS	16
+#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW	60000
+
 DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints);
 DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults);
 
@@ -45,6 +54,22 @@ static const struct ctl_table pghot_sysctls[] = {
 		.proc_handler   = proc_dointvec_minmax,
 		.extra1         = SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pghot_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
+	{
+		.procname	= "numa_balancing_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 };
 #endif
 
@@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
 	return 0;
 }
 
+/*
+ * For memory tiering mode, if there are enough free pages (more than
+ * enough watermark defined here) in fast memory node, to take full
+ * advantage of fast memory capacity, all recently accessed slow
+ * memory pages will be migrated to fast memory node without
+ * considering hot threshold.
+ */
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long enough_wmark;
+
+	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+			   pgdat->node_present_pages >> 4);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      promo_wmark_pages(zone) + enough_wmark,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * For memory tiering mode, too high promotion/demotion throughput may
+ * hurt application latency.  So we provide a mechanism to rate limit
+ * the number of pages that are tried to be promoted.
+ */
+static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsigned long rate_limit,
+					   int nr, unsigned long now_ms)
+{
+	unsigned long nr_cand;
+	unsigned int start;
+
+	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
+	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+	start = pgdat->nbp_rl_start;
+	if (now_ms - start > MSEC_PER_SEC &&
+	    cmpxchg(&pgdat->nbp_rl_start, start, now_ms) == start)
+		pgdat->nbp_rl_nr_cand = nr_cand;
+	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
+		return true;
+	return false;
+}
+
+static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat,
+						 unsigned long rate_limit, unsigned int ref_th,
+						 unsigned long now_ms)
+{
+	unsigned int start, th_period, unit_th, th;
+	unsigned long nr_cand, ref_cand, diff_cand;
+
+	th_period = KMIGRATED_PROMOTION_THRESHOLD_WINDOW;
+	start = pgdat->nbp_th_start;
+	if (now_ms - start > th_period &&
+	    cmpxchg(&pgdat->nbp_th_start, start, now_ms) == start) {
+		ref_cand = rate_limit *
+			KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
+		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
+		unit_th = ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS;
+		th = pgdat->nbp_threshold ? : ref_th;
+		if (diff_cand > ref_cand * 11 / 10)
+			th = max(th - unit_th, unit_th);
+		else if (diff_cand < ref_cand * 9 / 10)
+			th = min(th + unit_th, ref_th * 2);
+		pgdat->nbp_th_nr_cand = nr_cand;
+		pgdat->nbp_threshold = th;
+	}
+}
+
+static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
+					    unsigned long time)
+{
+	struct pglist_data *pgdat;
+	unsigned long rate_limit;
+	unsigned int th, def_th;
+	unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width jiffies */
+	unsigned long now = jiffies;
+
+	pgdat = NODE_DATA(nid);
+	if (pgdat_free_space_enough(pgdat)) {
+		/* workload changed, reset hot threshold */
+		pgdat->nbp_threshold = 0;
+		mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages);
+		return true;
+	}
+
+	def_th = sysctl_pghot_freq_window;
+	rate_limit = MB_TO_PAGES(sysctl_pghot_promote_rate_limit);
+	kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms);
+
+	th = pgdat->nbp_threshold ? : def_th;
+	if (pghot_access_latency(time, now) >= th)
+		return false;
+
+	return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
+}
+
 static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq,
 			     unsigned long *time)
 {
@@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
 			goto out_next;
 		}
 
+		if (!kmigrated_should_migrate_memory(nr, nid, time)) {
+			folio_put(folio);
+			goto out_next;
+		}
+
 		if (migrate_misplaced_folio_prepare(folio, NULL, nid)) {
 			folio_put(folio);
 			goto out_next;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d3fbe2a5d0e6..f28f786f8931 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1267,7 +1267,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_SWAP
 	[I(NR_SWAPCACHE)]			= "nr_swapcached",
 #endif
-#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_PGHOT
 	[I(PGPROMOTE_SUCCESS)]			= "pgpromote_success",
 	[I(PGPROMOTE_CANDIDATE)]		= "pgpromote_candidate",
 	[I(PGPROMOTE_CANDIDATE_NRL)]		= "pgpromote_candidate_nrl",
-- 
2.34.1
Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot
Posted by Bharata B Rao 3 days, 9 hours ago
Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]

>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b298cba853ab..fe957ff91df9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>[ ... ]
>> @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  		goto out_map;
>>
>>  	nid = folio_nid(folio);
>> +	needs_promotion = folio_is_promo_candidate(folio);
>>
>>  	target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
>>  					&last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;

> Does this code inadvertently record pages forbidden from migrating in pghot?
> If numa_migrate_check() returns NUMA_NO_NODE due to memory policy, the code
> jumps to out_map. At that point, nid retains its initialized value of
> folio_nid(folio). Since nid != NUMA_NO_NODE evaluates to true, it looks like
> pghot_record_access() will still be called at the end of the function, which
> could pollute the tracking map and trigger useless daemon sweeps.

While for NUMAB1, we need to call task_numa_fault() to just record the fault
stats, the equivalent of it for NUMAB2 isn't necessary. I will make the call
to pghot_record_access() conditional to this scenario.

>> +
>> +	if (needs_promotion) {
>> +		/*
>> +		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> +		 * Isolation and migration are handled by pghot.
>> +		 *
>> +		 * TODO: mode2 check
>> +		 */
>> +		writable = false;

> Can this cause a regression involving spurious write-protect faults? By forcing
> writable = false here before jumping to out_map, the write permission appears
> to be stripped from the restored mapping even if the page was previously
> writable. If the application access was a write, it might immediately suffer a
> spurious write-protect fault when it retries the instruction.
>> +		nid = target_nid;
>> +		goto out_map;
>> +	}

Since there is no migration from the fault context for NUMAB2, we don't
release PTL lock at all. Hence I realize that I don't have to modify
writable and ignore_writable at all for the hot page promotion case.

>[ ... ]
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2f815a34d924..289fa6c07a42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>[ ... ]
>> @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  		goto out_map;
>>
>>  	nid = folio_nid(folio);
>> +	needs_promotion = folio_is_promo_candidate(folio);
>>  	nr_pages = folio_nr_pages(folio);
>>
>>  	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>>  					writable, &last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;

> Similarly to do_huge_pmd_numa_page(), could this also inadvertently record
> pages forbidden from migrating when jumping to out_map with an unmodified
> nid?

Same as above.

>> -	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +
>> +	if (needs_promotion) {
>> +		/*
>> +		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> +		 * Isolation and migration are handled by pghot.
>> +		 */
>> +		writable = false;
>> +		ignore_writable = true;

> Would forcing writable = false and ignore_writable = true here also strip
> permissions and cause a performance regression from spurious write-protect
> faults for previously writable pages?

Same as above.

>> +		nid = target_nid;
>> +		goto out_map;
>> +	}
>> +
>> +	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
>> +	if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) {
>>  		flags |= TNF_MIGRATE_FAIL;
>>  		goto out_map;
>>  	}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a5f48984ed3e..db6832b4b95b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>[ ... ]
>> @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head
*folio_list, int node)
>>  #ifdef CONFIG_NUMA_BALANCING
>>  		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>>  		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
>> +#endif
>> +#ifdef CONFIG_PGHOT
>> +		mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
>>  #endif
>>  	}

> Should this use mod_lruvec_state() instead of mod_node_page_state() to
> accurately update both the node and memcg counters? PGPROMOTE_SUCCESS is
> defined in memcg_node_stat_items, so it is exported to the cgroup
> memory.stat interface. Using mod_node_page_state() bypasses memcg accounting,
> which might cause the cgroup-level promotion statistics to silently underflow
> or remain at zero.

Replied to this in one of the previous threads.

>> diff --git a/mm/pghot.c b/mm/pghot.c
>> index 7d7ef0800ae2..3c0ba254ad4c 100644
>> --- a/mm/pghot.c
>> +++ b/mm/pghot.c
[ ... ]
>> @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int
src, unsigned long now)
>>  	return 0;
>>  }
>[ ... ]
>> +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
>> +					    unsigned long time)
>> +{
>> +	struct pglist_data *pgdat;
>> +	unsigned long rate_limit;
>> +	unsigned int th, def_th;
>> +	unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width
jiffies */
>> +	unsigned long now = jiffies;

> Could this cause spurious threshold adjustments during millisecond
> wrap-arounds? jiffies_to_msecs() returns a 32-bit unsigned int which is
> zero-extended into the 64-bit unsigned long now_ms. When computing time
> differences like now_ms - start (where start is an unsigned int) in
> kmigrated_promotion_adjust_threshold(), the subtraction is performed in 64-bit
arithmetic.
> When the 32-bit counter wraps around, the subtraction could yield a very large
> value that falsely satisfies conditions like > MSEC_PER_SEC or > th_period,
> which could cause a regression in rate-limiting.

It could wrap around but due to cmpxchg updating pgdat->nbp_rl_start to
post-wrap value, it should be self-correcting. However I can change now_ms
unsigned int to be perfect.

>> +
>> +	pgdat = NODE_DATA(nid);
>> +	if (pgdat_free_space_enough(pgdat)) {
>[ ... ]
>> +
>> +	return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
>> +}
[ ... ]
>> @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn,
unsigned long end_pfn,
>>  			goto out_next;
>>  		}
>>
>> +		if (!kmigrated_should_migrate_memory(nr, nid, time)) {
>> +			folio_put(folio);
>> +			goto out_next;
>> +		}
>> +

> Does this correctly advance the PFN when encountering large folio tail pages?
> Looking at the rest of kmigrated_walk_zone(), the loop iterator pfn is
> advanced by nr = folio_nr_pages(folio) at the out_next label.
> If the loop lands on a tail page of a large folio (for example, if a
> previous iteration failed a check and incremented by 1), folio_nr_pages()
> returns the size of the entire large folio. Adding the full folio size to a
> tail page's PFN overshoots the end of the folio, potentially skipping valid
> pages of subsequent allocations.
> Would it be safer to advance by
> folio_nr_pages(folio) - folio_page_idx(folio, page)?

We could end up on tail pages leading to skipping of some folios but I think
they will be reached in the next pass. Anyway I will check if your suggestion
can be incorporated without any additional overhead.

Regards,
Bharata.