From nobody Sat Feb 7 11:38:46 2026 Received: from PH0PR06CU001.outbound.protection.outlook.com (mail-westus3azon11011057.outbound.protection.outlook.com [40.107.208.57]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 849092D7384 for ; Thu, 29 Jan 2026 14:43:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.208.57 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697838; cv=fail; b=ezgEraGbfO70YuB4lX/FF/30VeLrd6OGM985dYgYb0zzQtsPHXyMu6upctduKl1sis7+1vZ4McyS/Q/iQa5evnArfPBtI6YsQyuLc5j/U8PbnjTDUc8+v9yXap3mO4KfFxnCB/G0/NI67eaDOI7jtJLiu0BBVprTx9u7p0VP/J0= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697838; c=relaxed/simple; bh=LjwL7bvnAMGF+c9XkJRSaJrJSUxLLiL30ugEbQQP/38=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=IX+dUpp2p2iH4tZv8P10pWkh+soh0NMENewFTbK5m9TZAJowLrxbAgW5W8YDqUD5CfgHRFFLaXJtM50uZb9nnRieehV+bwqNEIuECJYnQi4s6g64T6AeTw3Rat0mn5pcEhSTo2hG9VWOuRHBIlgnSTE2G+xul+qnBWuUZAi0XOE= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=FYBGgP99; arc=fail smtp.client-ip=40.107.208.57 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="FYBGgP99" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ov6GxMRxGcsCY8SZ4we00eW6IvMPXsR4+0u47M6fTzT2UZU/06xJHy7mtXISin3byoNv5dsWwYAg1pmsuXua0VyXaXtP9BbTfUDVXzVOqm62pIAXMN5E/2UFQ6cdYvJNDiCUqAG66lZlTVTSJg8FmiOxPFJGI+R9JXuQqW5P3eF1+kdKdD7d96AXgzcdgfVlxhD+okj9h2LNHTnawmNmISvI1NkcETjc4LqzZ5LRGQcL/A1HaJ61AxNqY2FV3OG7WWdtqrusbeCzAbzkEa2M1Yx8CeeLPvhjeRGz7o/92NDAm3os5+982fuxfJdKHfbl3TwSk51Ze6Xk6Zj+ssrENA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=geW2Hchw2hipgNtUXmkYBzbL3lmxlreN3f1iPu8nAVM=; b=lQG/nHplw+Ky2aOdNxTkO2/4n6jISLU/1uQ8DFJmuI7lt7qy8NsHBvB+NQ5XpR/KIFJCtfsT20uqxob9/em9xKQLuywPiOFTCsx/e02yAUrvRibvJttzIguUoJ6e8f48oW3c8/M4qmkUhpObHMfgjQFTSOCC11Lj5w4u8UxFwOtyYiFe0WUX57qR+RbdHGavHK6jga+yNsIfF9UempQZNX4h3OxAamxhYq0+NLY72mQ2oPwD7KPCbtLcjNr8ztlbtkVP3qxv8yFnhpOX1Liwfws8seBInzepWSuJSQPNUK3IDFObLXAkEKkhQ1D8+wiVF1El2BL8/PCr3d3pYfl+/Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=geW2Hchw2hipgNtUXmkYBzbL3lmxlreN3f1iPu8nAVM=; b=FYBGgP99z14DgkUKkXE1tjotMXUmt0p/v2pImm2tk7t5dfn2Isj0yIen0oviQoOyYPXLh992OPbFA7mXYStUDgx1Tm1IajtDa5SzZQQnQwiJt7etVhUCCPXBe45vY36LU24FIsMdYpDcB4smbg+7+QX83+yzyDahmkxQkd0hio4= Received: from BL1PR13CA0426.namprd13.prod.outlook.com (2603:10b6:208:2c3::11) by MW4PR12MB7215.namprd12.prod.outlook.com (2603:10b6:303:228::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:43:45 +0000 Received: from BN2PEPF000055DA.namprd21.prod.outlook.com (2603:10b6:208:2c3:cafe::43) by BL1PR13CA0426.outlook.office365.com (2603:10b6:208:2c3::11) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.7 via Frontend Transport; Thu, 29 Jan 2026 14:43:39 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055DA.mail.protection.outlook.com (10.167.245.4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:43:39 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:43:29 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 05/10] mm: sched: move NUMA balancing tiering promotion to pghot Date: Thu, 29 Jan 2026 20:10:38 +0530 Message-ID: <20260129144043.231636-6-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055DA:EE_|MW4PR12MB7215:EE_ X-MS-Office365-Filtering-Correlation-Id: 16c747cb-30e5-4748-1927-08de5f44caa6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|7416014|36860700013|82310400026|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?ihCqLzDMbu97E9XSPwxAArG1KN5ZEtbqekEfDwvHAKmxxENB0flaC0KN9cAM?= =?us-ascii?Q?a2TpPa56wu/rVwbo8IV6HN8ZiremCDq7YWIb0StYURq8hQ7xIjytyEEKDOiP?= =?us-ascii?Q?aLeRvjHDSmH/CcyZ8fH7n3GA7xjEfmLol50CLtS9fqAdhe/xMp+gLFqQsBs/?= =?us-ascii?Q?RpqbJK1B+R1EXxcPJO/t2z57XQzK9Pl6u2+hutzopAXf/b8X8ag1Jhe1jKtQ?= =?us-ascii?Q?c+AfxAKaWSlhQxTXEv9ais+0TcwPJ1kK1LMqWAhGEQnd/Xgu5msKkegfXIU7?= =?us-ascii?Q?DTrz0sQJUw8RcInfBX6XJnGCRRlOYcQ/6U8lYDf3ErB8V9zdV/QyAazAOUfz?= =?us-ascii?Q?miEq4ogev4FIcZot33I0XW/rfU8YShHTldSdoQaBITsO+LMRTZMiL1O+loSw?= =?us-ascii?Q?WtorntWtDXq8qElYbUcBz4dMYPO0eL+vRVHs4OmlWeHsKVM/FGvqQBSC+O/e?= =?us-ascii?Q?S7pvz9M+gYV8AsNHuVHwf1lUpemtRyw002oBf3Fp2zCGIybFHYMpWdT8KRyl?= =?us-ascii?Q?+z9GDLL7dka37tOwH0K8gXi1g6ppIPmAaf/I3VNEIFckGLiArkjg7V8JVFHu?= =?us-ascii?Q?vH8HhX+BwOFSQMFdRjJoLovYNU33mpRwURHuN/Dg6lzddMrDywNfrY+ltjXM?= =?us-ascii?Q?HiUyMZBUGCso0qzrQp9MWC1ynfnul28p9UmqlmrIm+kk2PgN6xzmxmtm5iT7?= =?us-ascii?Q?+YCAF3ZhCshNcjH6r+gOFE5giXbevKBNtMfDk0zBzk7wAb3MAv9CylNsHJPY?= =?us-ascii?Q?4Mu7u+BNHcxOkHybT4pqMHA9ZBRUVnpRahzllRgnkmvz1p7T3ysD7kk9YcCA?= =?us-ascii?Q?6y/FlIYYrQMDpqNouG+mawLI8keMRNjZX134SzfStoYIEQii19P/bVMOKUan?= =?us-ascii?Q?32lyVfJBM83lBGPrk18i3ji5iQ/uh/oQ1sxN9fq9a7yeZTFOMF8ckNjg42lq?= =?us-ascii?Q?tPsZIPGEgRGLcAnacnUVHNz9B40FRhgVhlldNWLTyCTbVUKQbJWVR2dmIeTP?= =?us-ascii?Q?RJ922FZD17KhixC/mJmvoCBH3ddQSX5cso6E46bTIENbM+xZY/P8BTPb3FB1?= =?us-ascii?Q?PtlWZhfzAKYjJlgIIH//BbK1eq1FQ6qbsJDOaU5PyKK7PaLGsJJAG49CWHEU?= =?us-ascii?Q?2iQh/EfxnqcR824aIsYe7PiOwzmJVTe5qfljWw18BkQBoRjAJWdLKwsnkQbH?= =?us-ascii?Q?4KxGHJRlEVqjUKQC2jJtbVlfz8sZqECXuDMEi2YTQGuPPIubwKOiVXnckhtt?= =?us-ascii?Q?cBs/pPT6RvnAttvw3CzO+escJJQPqrzYNMhfcOIZ2JU3YdREeeoe8VxmP6AM?= =?us-ascii?Q?h8gyqhbfMVQJ8tpU1a7JXL/ByRsro/u+Wffvo/EZsHRwz/dvZpb04WYYpucR?= =?us-ascii?Q?6W0W7V6uao5PdZ765fzL9ffkwwiS5lzjN4nTnFqnxpoMBLI37OrcKHk/yVcu?= =?us-ascii?Q?0NmW9cg6lgc5SQ50gSl9B6qfZX2y6TH9jEmvttlr2Ak/WKCBjfDggEWMgBIo?= =?us-ascii?Q?+Fg+5JYFvOm9hHI5YY2qEDHBHl9H/wbxxoL4gaxx1qgQ7hQJVbQbH28+0k1C?= =?us-ascii?Q?ghoidCV7pTRd8RqvkWM1kRlCaW2oldplVSRBJLEweYPxiS8gHa53AfcSGKVe?= =?us-ascii?Q?2KZ/+m6oekfQjfqF8CJGrr8H0JrNmyMzOcj5FrQdYtPOhPzrU6Uv6jbTzAuw?= =?us-ascii?Q?tJL41g=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(7416014)(36860700013)(82310400026)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:43:39.5047 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 16c747cb-30e5-4748-1927-08de5f44caa6 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055DA.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MW4PR12MB7215 Content-Type: text/plain; charset="utf-8" Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING mode of NUMA Balancing) does hot page detection (via hint faults), hot page classification and eventual promotion, all by itself and sits within the scheduler. With pghot, the new hot page tracking and promotion mechanism being available, NUMA Balancing can limit itself to detection of hot pages (via hint faults) and off-load rest of the functionality to the common hot page tracking system. pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the hot page info to pghot. In addition, the migration rate limiting and dynamic threshold logic are moved to kmigrated so that the same can be used for hot pages reported by other sources too. Signed-off-by: Bharata B Rao --- kernel/sched/debug.c | 1 - kernel/sched/fair.c | 152 ++----------------------------------------- mm/huge_memory.c | 26 ++------ mm/memory.c | 31 ++------- mm/pghot.c | 124 +++++++++++++++++++++++++++++++++++ 5 files changed, 141 insertions(+), 193 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 41caa22e0680..02931902a9c6 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -520,7 +520,6 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_min); debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_max); debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_sca= n_size); - debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da46c3164537..4e70f58fbbfa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu) static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ -static unsigned int sysctl_numa_balancing_promote_rate_limit =3D 65536; -#endif - #ifdef CONFIG_SYSCTL static const struct ctl_table sched_fair_sysctls[] =3D { #ifdef CONFIG_CFS_BANDWIDTH @@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] =3D= { .extra1 =3D SYSCTL_ONE, }, #endif -#ifdef CONFIG_NUMA_BALANCING - { - .procname =3D "numa_balancing_promote_rate_limit_MBps", - .data =3D &sysctl_numa_balancing_promote_rate_limit, - .maxlen =3D sizeof(unsigned int), - .mode =3D 0644, - .proc_handler =3D proc_dointvec_minmax, - .extra1 =3D SYSCTL_ZERO, - }, -#endif /* CONFIG_NUMA_BALANCING */ }; =20 static int __init sched_fair_sysctl_init(void) @@ -1427,9 +1412,6 @@ unsigned int sysctl_numa_balancing_scan_size =3D 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in m= s */ unsigned int sysctl_numa_balancing_scan_delay =3D 1000; =20 -/* The page with hint page fault latency < threshold in ms is considered h= ot */ -unsigned int sysctl_numa_balancing_hot_threshold =3D MSEC_PER_SEC; - struct numa_group { refcount_t refcount; =20 @@ -1784,108 +1766,6 @@ static inline bool cpupid_valid(int cpupid) return cpupid_to_cpu(cpupid) < nr_cpu_ids; } =20 -/* - * For memory tiering mode, if there are enough free pages (more than - * enough watermark defined here) in fast memory node, to take full - * advantage of fast memory capacity, all recently accessed slow - * memory pages will be migrated to fast memory node without - * considering hot threshold. - */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) -{ - int z; - unsigned long enough_wmark; - - enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); - for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { - struct zone *zone =3D pgdat->node_zones + z; - - if (!populated_zone(zone)) - continue; - - if (zone_watermark_ok(zone, 0, - promo_wmark_pages(zone) + enough_wmark, - ZONE_MOVABLE, 0)) - return true; - } - return false; -} - -/* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, - * - * hint page fault latency =3D hint page fault time - scan time - * - * The smaller the hint page fault latency, the higher the possibility - * for the page to be hot. - */ -static int numa_hint_fault_latency(struct folio *folio) -{ - int last_time, time; - - time =3D jiffies_to_msecs(jiffies); - last_time =3D folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; -} - -/* - * For memory tiering mode, too high promotion/demotion throughput may - * hurt application latency. So we provide a mechanism to rate limit - * the number of pages that are tried to be promoted. - */ -static bool numa_promotion_rate_limit(struct pglist_data *pgdat, - unsigned long rate_limit, int nr) -{ - unsigned long nr_cand; - unsigned int now, start; - - now =3D jiffies_to_msecs(jiffies); - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - start =3D pgdat->nbp_rl_start; - if (now - start > MSEC_PER_SEC && - cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) - pgdat->nbp_rl_nr_cand =3D nr_cand; - if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) - return true; - return false; -} - -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now =3D jiffies_to_msecs(jiffies); - th_period =3D sysctl_numa_balancing_scan_period_max; - start =3D pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { - ref_cand =3D rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; - unit_th =3D ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th =3D pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th =3D max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th =3D min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand =3D nr_cand; - pgdat->nbp_threshold =3D th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -1901,33 +1781,11 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct folio *folio, =20 /* * The pages in slow memory node should be migrated according - * to hot/cold instead of private/shared. - */ - if (folio_use_access_time(folio)) { - struct pglist_data *pgdat; - unsigned long rate_limit; - unsigned int latency, th, def_th; - long nr =3D folio_nr_pages(folio); - - pgdat =3D NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { - /* workload changed, reset hot threshold */ - pgdat->nbp_threshold =3D 0; - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr); - return true; - } - - def_th =3D sysctl_numa_balancing_hot_threshold; - rate_limit =3D MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th =3D pgdat->nbp_threshold ? : def_th; - latency =3D numa_hint_fault_latency(folio); - if (latency >=3D th) - return false; - - return !numa_promotion_rate_limit(pgdat, rate_limit, nr); - } + * to hot/cold instead of private/shared. Also the migration + * of such pages are handled by kmigrated. + */ + if (folio_use_access_time(folio)) + return true; =20 this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D folio_xchg_last_cpupid(folio, this_cpupid); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40cf59301c21..f52587e70b3c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -2217,29 +2218,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) =20 target_nid =3D numa_migrate_check(folio, vmf, haddr, &flags, writable, &last_cpupid); + nid =3D target_nid; if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { - flags |=3D TNF_MIGRATE_FAIL; - goto out_map; - } - /* The folio is isolated and isolation code holds a folio reference. */ - spin_unlock(vmf->ptl); - writable =3D false; =20 - if (!migrate_misplaced_folio(folio, target_nid)) { - flags |=3D TNF_MIGRATED; - nid =3D target_nid; - task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); - return 0; - } + writable =3D false; =20 - flags |=3D TNF_MIGRATE_FAIL; - vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { - spin_unlock(vmf->ptl); - return 0; - } out_map: /* Restore the PMD */ pmd =3D pmd_modify(pmdp_get(vmf->pmd), vma->vm_page_prot); @@ -2250,8 +2234,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) + if (nid !=3D NUMA_NO_NODE) { + pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies); task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + } return 0; } =20 diff --git a/mm/memory.c b/mm/memory.c index 2a55edc48a65..98a9a3b675a0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -75,6 +75,7 @@ #include #include #include +#include #include #include #include @@ -6046,34 +6047,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) =20 target_nid =3D numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); + nid =3D target_nid; if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { - flags |=3D TNF_MIGRATE_FAIL; - goto out_map; - } - /* The folio is isolated and isolation code holds a folio reference. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); + writable =3D false; ignore_writable =3D true; - - /* Migrate to the requested node */ - if (!migrate_misplaced_folio(folio, target_nid)) { - nid =3D target_nid; - flags |=3D TNF_MIGRATED; - task_numa_fault(last_cpupid, nid, nr_pages, flags); - return 0; - } - - flags |=3D TNF_MIGRATE_FAIL; - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); - if (unlikely(!vmf->pte)) - return 0; - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } out_map: /* * Make it present again, depending on how arch implements @@ -6087,8 +6066,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) writable); pte_unmap_unlock(vmf->pte, vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) + if (nid !=3D NUMA_NO_NODE) { + pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies); task_numa_fault(last_cpupid, nid, nr_pages, flags); + } return 0; } =20 diff --git a/mm/pghot.c b/mm/pghot.c index bf1d9029cbaa..6fc76c1eaff8 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -17,6 +17,9 @@ * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into * their targeted nodes. + * + * Migration rate-limiting and dynamic threshold logic implementations + * were moved from NUMA Balancing mode 2. */ #include #include @@ -31,6 +34,12 @@ unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BA= TCH_NR; =20 unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; =20 +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +static unsigned int sysctl_pghot_promote_rate_limit =3D 65536; + +#define KMIGRATED_MIGRATION_ADJUST_STEPS 16 +#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW 60000 + DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); DEFINE_STATIC_KEY_FALSE(pghot_src_pgtscans); DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); @@ -45,6 +54,14 @@ static const struct ctl_table pghot_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ZERO, }, + { + .procname =3D "pghot_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, }; #endif =20 @@ -138,6 +155,110 @@ int pghot_record_access(unsigned long pfn, int nid, i= nt src, unsigned long now) return 0; } =20 +/* + * For memory tiering mode, if there are enough free pages (more than + * enough watermark defined here) in fast memory node, to take full + * advantage of fast memory capacity, all recently accessed slow + * memory pages will be migrated to fast memory node without + * considering hot threshold. + */ +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long enough_wmark; + + enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { + struct zone *zone =3D pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + promo_wmark_pages(zone) + enough_wmark, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsi= gned long rate_limit, + int nr, unsigned long now_ms) +{ + unsigned long nr_cand; + unsigned int start; + + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start =3D pgdat->nbp_rl_start; + if (now_ms - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now_ms) =3D=3D start) + pgdat->nbp_rl_nr_cand =3D nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) + return true; + return false; +} + +static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, unsigned int ref_th, + unsigned long now_ms) +{ + unsigned int start, th_period, unit_th, th; + unsigned long nr_cand, ref_cand, diff_cand; + + th_period =3D KMIGRATED_PROMOTION_THRESHOLD_WINDOW; + start =3D pgdat->nbp_th_start; + if (now_ms - start > th_period && + cmpxchg(&pgdat->nbp_th_start, start, now_ms) =3D=3D start) { + ref_cand =3D rate_limit * + KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC; + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; + unit_th =3D ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS; + th =3D pgdat->nbp_threshold ? : ref_th; + if (diff_cand > ref_cand * 11 / 10) + th =3D max(th - unit_th, unit_th); + else if (diff_cand < ref_cand * 9 / 10) + th =3D min(th + unit_th, ref_th * 2); + pgdat->nbp_th_nr_cand =3D nr_cand; + pgdat->nbp_threshold =3D th; + } +} + +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int ni= d, + unsigned long time) +{ + struct pglist_data *pgdat; + unsigned long rate_limit; + unsigned int th, def_th; + unsigned long now_ms =3D jiffies_to_msecs(jiffies); /* Based on full-widt= h jiffies */ + unsigned long now =3D jiffies; + + pgdat =3D NODE_DATA(nid); + if (pgdat_free_space_enough(pgdat)) { + /* workload changed, reset hot threshold */ + pgdat->nbp_threshold =3D 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages); + return true; + } + + def_th =3D sysctl_pghot_freq_window; + rate_limit =3D MB_TO_PAGES(sysctl_pghot_promote_rate_limit); + kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms); + + th =3D pgdat->nbp_threshold ? : def_th; + if (pghot_access_latency(time, now) >=3D th) + return false; + + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_m= s); +} + static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, unsigned long *time) { @@ -197,6 +318,9 @@ static void kmigrated_walk_zone(unsigned long start_pfn= , unsigned long end_pfn, if (folio_nid(folio) =3D=3D nid) goto out_next; =20 + if (!kmigrated_should_migrate_memory(nr, nid, time)) + goto out_next; + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) goto out_next; =20 --=20 2.34.1