From nobody Thu Apr 2 15:36:14 2026 Received: from SJ2PR03CU001.outbound.protection.outlook.com (mail-westusazon11012065.outbound.protection.outlook.com [52.101.43.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15DF5388375 for ; Mon, 23 Mar 2026 09:52:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.43.65 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259530; cv=fail; b=niAxJNgOYYwxcOacamz5CCRFpsHNDRi6qpXQ9mSLlC85YYdCeFg9u8TZKz0Y7yOxMWA+8wDaen4Uk8qBWEBqKHZUDXVC/TCh+dBzsiiaYYaP/R8MRiJJz+t5xPE+Dg7v8Y79OFMjRvtyFPksMU3gsJlTk5tU3r9dTi8Jt0aEP+g= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259530; c=relaxed/simple; bh=yIp+ber6dsyHKzQZxcb1iBQJdGf44/ppPzApv1gesVU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XzKrBQxLa4heCvIId5o38vFDUzlrJrj82QAedUKNwQZNBfCkOQ2/9IMVBIGmr9G3/1IheUlmKqjp0Mv4WWOstzTTxImDZXK4pETkR9tE4jmwyKcz6avVsXvodUL8OogjOQMoe0IDUHBY7cSNuZHA8EhWLTFkNOHf6nLc8TPXg7A= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=M+8BBici; arc=fail smtp.client-ip=52.101.43.65 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="M+8BBici" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=fwgWxGFPFVT44stZr+YoQiCeexwgrrpxRk7kYCJz2DBE9vK/sVr5otu2uQsxpwTNB5JBO0rqvVxEjeO7F7nQxoh34exWaGzpVvOXACuic+0FdBsWch9zO09+GqwU9hmUpdEllanZo80RW5z7BOY3B0aDHxsDAar7kK6BLsgqeMy4MlQfpmqQYZiXAku54md9qUTGnosOQCLVJgLVmYlgkYX7UgOObgH9aN9E/ALKRgSdjl4hxObQT+2/1pu28RGI3XOmRPIc6SHJP7e+G5RBkMnBWccE7cV6XmwrZw60WHC5C8T/K0u8Iis+ZHz1pbn4hrf9isWhgMRjwCNCC6/G2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=HaQ71kQlJc9KGhasWhyoMsrC4mXqbCyWzXUFOmtNu6Y=; b=B6uHiAtrKdQVt9sQ1HlLCMueSahHwgBTEw7zUO2q/WnBYt1mAxoyvG7MJyyFi+U2WCAPpAfn2n3ilUlT19ezKNUKyUEVCxhqwVfbYPtkc0Dhnz2Frhf9oAPc5Pk80BfDAkEnNgfKysLPVGfiJwnJzLDrzLUjq5iTnpw/34McWFtu1Sq4Whkyeb/kQTCI07pJW0/Tm0vzIoQE7iOIrfqw7WRmvIU4vSfPu/p1i2sWnatxOeTcmuRw90OfmjFVxVGylrJwuyJBCEEXgnISMUEqKdOWVtFatRwK8XapQCvYslJJO6zRKNrwSBupRUijGrrueT8W7Q+WFPUMtY3B8mRamg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HaQ71kQlJc9KGhasWhyoMsrC4mXqbCyWzXUFOmtNu6Y=; b=M+8BBici8p9wiFB332HtWUkX7UzyyyRmqwH+JXMGNCiOuhzBS4XVXXgaYAu3zOWK2p3LNuicNqMmPRKyhBpLdnvRPYMec5kocBihXF7qrP/X0Y+UnW3Wy9oXGZZdiEjAtxmY64+6fmvU9MH8SIdFdhdTMWOy9GdsowApxecLQtc= Received: from CH0PR03CA0405.namprd03.prod.outlook.com (2603:10b6:610:11b::15) by SA0PR12MB4480.namprd12.prod.outlook.com (2603:10b6:806:99::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.9; Mon, 23 Mar 2026 09:51:58 +0000 Received: from CH2PEPF0000009F.namprd02.prod.outlook.com (2603:10b6:610:11b:cafe::67) by CH0PR03CA0405.outlook.office365.com (2603:10b6:610:11b::15) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:51:56 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009F.mail.protection.outlook.com (10.167.244.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9723.19 via Frontend Transport; Mon, 23 Mar 2026 09:51:56 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:51:48 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Date: Mon, 23 Mar 2026 15:21:02 +0530 Message-ID: <20260323095104.238982-4-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009F:EE_|SA0PR12MB4480:EE_ X-MS-Office365-Filtering-Correlation-Id: 529e6e15-38b4-468f-5b9d-08de88c1d227 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|1800799024|36860700016|7416014|376014|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: uGnrJGjBWzJhL52jvIFK4yYtIBDixzmkOEku0IQImqgqhGVXAJ2QVeeOTw+tfyQBshkfHw+PA6J4OyHAd2aGE+z2BuxIjs+FXG6oZJtUBEA9EWsJRO/4lH7BGT2MhfmTPeiCGVeWQBYR0eKUcvQX28ZHGZD5LR24hjSP0JSBw8vKWJ09MNXt1fmYwrQQwicprjKZjH294JyPXkzlnm9JVESEJ886midyHPgxFmiDvH/EndIkvyArxaoZZLg7VAV0sMitFxPDJNk4aYAjbDfevvZt9YVFrTI9BMRU8YhZR6hmwcDIZhPRwWLXDnynwohFMVMka9b1Rp/psMlULOHRe4x62Y5tLN7zVEbRubETvMuuWHlPEgCAhJVJmiblMHw3NM5KtYIuVeRmM8QWvMetmGwxdH/1GoP/gcU4Esly7ZwSY98wVoiSJSrtk20oy5b//eOJtv+8tMR+7RpJIJWl82U4eQ8qK5JZHk9PPTDh3BiOpfEggBTClojJAh9z0TA91uhNa95EUtamaBOkUuoUw6mYc+ZYk5MAufVzU9QRrAKNMoHIxYy1WsArSBLQutSRJECi6MX1ZA3wset97TPiZln3sYdkg6lDYrnxLQcma0rJVTQlPI65SX/W4ERxqVQmDW1HDeNVDT0zs2gmJGeSzlbvEFyhgDXyzGidv2mElvpp3rGotzfuA/XoA0usUFXCOeNLPpSAs8Y3aQP/0WVmnhyfiviyptINMQGZagjYaWrdUpi+2YYkSsmZ49b4BmNEE+I5f5SfwziZ807kTky+Qw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(1800799024)(36860700016)(7416014)(376014)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: gfTxkhiZpYpPbu82WiY7Wg9z4I2v5JIBkSFKIe/MXKGXGq/0f3mZEJ+sKWK4WPtwD+URjwMAp0lZY0giV0JC5VDOSA51MIxnq2xYnGcLy5rNXVmudw6xmb0Le2Q27W9ziqsC0d9wo0HYmEFW7AKd/MLyIxu+82cKdu9UDRQlXa/ECjQPMtKn8JERhsOxDqucteWvOG8JcVcoaQYWRdFsQwTpV3FJDgSr033da0K9E4bA6D+1SfuJRhemezhhlAhLXfC41E2XOuPi9uwSiONUayeks6AMjSuCJcM70pguUoglkn7iKFmMNj+yhzQ5gh430xplGofpVtUquZOlXRzlflM1YAzFg7YdeL9pQ5zydeJ/nYM6+PQmHwnwhlh4BN2fdxCzUbA9cuJIM6q9mJ+0fqdshGGMukJsywiJdPy5bmQi7RAPhUXmKNT++xA1If0W X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:51:56.8498 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 529e6e15-38b4-468f-5b9d-08de88c1d227 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009F.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA0PR12MB4480 pghot is a subsystem that collects memory access information from multiple sources, classifies hot pages resident in lower-tier memory, and promotes them to faster tiers. It stores per-PFN hotness metadata and performs asynchronous, batched promotion via a per-lower-tier-node kernel thread (kmigrated). This change introduces the default (compact) mode of pghot: - Per-PFN hotness record (phi_t =3D u8) embedded via mem_section: - 2 bits: access frequency (4 levels) - 5 bits: time bucket (=E2=89=884s window with HZ=3D1000, bucketed jiffie= s) - 1 bit : migration-ready flag (MSB) The LSB of mem_section->hot_map pointer is used as a per-section "hot" flag to gate scanning. - Event recording API: int pghot_record_access(unsigned long pfn, int nid, int src, unsigned lon= g now) @pfn: The PFN of the memory accessed @nid: The accessing NUMA node ID @src: The temperature source (subsystem) that generated the access info @time: The access time in jiffies - Sources (e.g., NUMA hint faults, HW hints) call this to report accesses. - In default mode, the nid is not stored/used for targeting; promotion goes to a configurable toptier node (pghot_target_nid). - Promotion engine: - One kmigrated thread per lower-tier node. - Scans only sections whose "hot" flag was raised, iterates PFNs, and batches candidates by destination node. - Uses migrate_misplaced_folios_batch() to move batched folios. - Tunables & stats: - debugfs: enabled_sources, target_nid, freq_threshold, kmigrated_sleep_ms, kmigrated_batch_nr - sysctl : vm.pghot_promote_freq_window_ms - vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults, pghot_recorded_hwhints Memory overhead --------------- Default mode uses 1 byte of hotness metadata per PFN on lower-tier nodes. Behavior & policy ----------------- - Default mode promotion target: The nid passed by sources is not stored; hot pages promote to pghot_target_nid (toptier). Precision mode (added later in the series) changes this. - Record consumption: kmigrated consumes (clears) the "migration-ready" bit before attempting isolation. If isolation/migration fails, the folio is not re-queued automatically; subsequent accesses will re-arm it. This avoids retry storms and keeps batching stable. - Wakeups: kmigrated wakeups are intentionally timeout-driven in v6. We set the per-pgdat "activate" flag on access, and kmigrated checks this flag on its next sleep interval. This keeps the first cut simple and avoids potential wake storms; active wakeups can be considered in a follow-up. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 80 +++++ include/linux/migrate.h | 4 +- include/linux/mmzone.h | 20 ++ include/linux/pghot.h | 82 +++++ include/linux/vm_event_item.h | 5 + mm/Kconfig | 14 + mm/Makefile | 1 + mm/migrate.c | 19 +- mm/mm_init.c | 10 + mm/pghot-default.c | 79 ++++ mm/pghot-tunables.c | 182 ++++++++++ mm/pghot.c | 479 +++++++++++++++++++++++++ mm/vmstat.c | 5 + 13 files changed, 971 insertions(+), 9 deletions(-) create mode 100644 Documentation/admin-guide/mm/pghot.txt create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot-default.c create mode 100644 mm/pghot-tunables.c create mode 100644 mm/pghot.c diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt new file mode 100644 index 000000000000..5f51dd1d4d45 --- /dev/null +++ b/Documentation/admin-guide/mm/pghot.txt @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D +PGHOT: Hot Page Tracking Tunables +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D +The PGHOT subsystem tracks frequently accessed pages in lower-tier memory = and +promotes them to faster tiers. It uses per-PFN hotness metadata and asynch= ronous +migration via per-node kernel threads (kmigrated). + +This document describes tunables available via **debugfs** and **sysctl** = for +PGHOT. + +Debugfs Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Path: /sys/kernel/debug/pghot/ + +1. **enabled_sources** + - Bitmask to enable/disable hotness sources. + - Bits: + - 0: Hint faults (value 0x1) + - 1: Hardware hints (value 0x2) + - Default: 0 (disabled) + - Example: + # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources + Enables all sources. + +2. **target_nid** + - Toptier NUMA node ID to which hot pages should be promoted when source + does not provide nid. Used when hotness source can't provide accessing + NID or when the tracking mode is default. + - Default: 0 + - Example: + # echo 1 > /sys/kernel/debug/pghot/target_nid + +3. **freq_threshold** + - Minimum access frequency before a page is marked ready for promotion. + - Range: 1 to 3 + - Default: 2 + - Example: + # echo 3 > /sys/kernel/debug/pghot/freq_threshold + +4. **kmigrated_sleep_ms** + - Sleep interval (ms) for kmigrated thread between scans. + - Default: 100 + +5. **kmigrated_batch_nr** + - Maximum number of folios migrated in one batch. + - Default: 512 + +Sysctl Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +1. pghot_promote_freq_window_ms + +Path: /proc/sys/vm/pghot_promote_freq_window_ms + +- Controls the time window (in ms) for counting access frequency. A page is + considered hot only when **freq_threshold** number of accesses occur with + this time period. +- Default: 3000 (3 seconds) +- Example: + # sysctl vm.pghot_promote_freq_window_ms=3D3000 + +Vmstat Counters +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Following vmstat counters provide some stats about pghot subsystem. + +Path: /proc/vmstat + +1. **pghot_recorded_accesses** + - Number of total hot page accesses recorded by pghot. + +2. **pghot_recorded_hintfaults** + - Number of recorded accesses reported by NUMA Balancing based + hotness source. + +3. **pghot_recorded_hwhints** + - Number of recorded accesses reported by hwhints source. diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 5c1e2691cec2..7f912b6ebf02 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softle= af_t entry, spinlock_t *p =20 #endif /* CONFIG_MIGRATION */ =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); @@ -127,7 +127,7 @@ static inline int migrate_misplaced_folios_batch(struct= list_head *folio_list, { return -EAGAIN; /* can't migrate now */ } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ =20 #ifdef CONFIG_MIGRATION =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3e51190a55e4..d7ed60956543 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1064,6 +1064,7 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_KMIGRATED_ACTIVATE, /* activates kmigrated */ }; =20 enum zone_flags { @@ -1518,6 +1519,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif +#ifdef CONFIG_PGHOT + struct task_struct *kmigrated; + wait_queue_head_t kmigrated_wait; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) @@ -1930,12 +1935,27 @@ struct mem_section { unsigned long section_mem_map; =20 struct mem_section_usage *usage; +#ifdef CONFIG_PGHOT + /* + * Per-PFN hotness data for this section. + * Array of phi_t (u8 in default mode). + * LSB is used as PGHOT_SECTION_HOT_BIT flag. + */ + void *hot_map; +#endif #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; +#endif + /* + * Padding to maintain consistent mem_section size when exactly + * one of PGHOT or PAGE_EXTENSION is enabled. This ensures + * optimal alignment regardless of configuration. + */ +#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION)) unsigned long pad; #endif /* diff --git a/include/linux/pghot.h b/include/linux/pghot.h new file mode 100644 index 000000000000..525d4dd28fc1 --- /dev/null +++ b/include/linux/pghot.h @@ -0,0 +1,82 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PGHOT_H +#define _LINUX_PGHOT_H + +/* Page hotness temperature sources */ +enum pghot_src { + PGHOT_HINTFAULTS =3D 0, + PGHOT_HWHINTS, + PGHOT_SRC_MAX +}; + +#ifdef CONFIG_PGHOT +#include + +extern unsigned int pghot_target_nid; +extern unsigned int pghot_src_enabled; +extern unsigned int pghot_freq_threshold; +extern unsigned int kmigrated_sleep_ms; +extern unsigned int kmigrated_batch_nr; +extern unsigned int sysctl_pghot_freq_window; + +void pghot_debug_init(void); + +DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults); +DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); + +#define PGHOT_HINTFAULTS_ENABLED BIT(PGHOT_HINTFAULTS) +#define PGHOT_HWHINTS_ENABLED BIT(PGHOT_HWHINTS) +#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_SRC_MAX - 1, 0) + +#define PGHOT_DEFAULT_FREQ_THRESHOLD 2 + +#define KMIGRATED_DEFAULT_SLEEP_MS 100 +#define KMIGRATED_DEFAULT_BATCH_NR 512 + +#define PGHOT_DEFAULT_NODE 0 + +#define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC) + +/* + * Bits 0-6 are used to store frequency and time. + * Bit 7 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 7 + +#define PGHOT_FREQ_WIDTH 2 +/* Bucketed time is stored in 5 bits which can represent up to 3.9s with H= Z=3D1000 */ +#define PGHOT_TIME_BUCKETS_SHIFT 7 +#define PGHOT_TIME_WIDTH 5 +#define PGHOT_NID_WIDTH 10 + +#define PGHOT_FREQ_SHIFT 0 +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) +#define PGHOT_TIME_BUCKETS_MASK (PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SH= IFT) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u8 phi_t; + +#define PGHOT_RECORD_SIZE sizeof(phi_t) + +#define PGHOT_SECTION_HOT_BIT 0 +#define PGHOT_SECTION_HOT_MASK BIT(PGHOT_SECTION_HOT_BIT) + +bool pghot_nid_valid(int nid); +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime); +bool pghot_update_record(phi_t *phi, int nid, unsigned long now); +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time); + +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now); +#else +static inline int pghot_record_access(unsigned long pfn, int nid, int src,= unsigned long now) +{ + return 0; +} +#endif /* CONFIG_PGHOT */ +#endif /* _LINUX_PGHOT_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 22a139f82d75..4ce670c1bb02 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -188,6 +188,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSTACK_REST, #endif #endif /* CONFIG_DEBUG_STACK_USAGE */ +#ifdef CONFIG_PGHOT + PGHOT_RECORDED_ACCESSES, + PGHOT_RECORDED_HINTFAULTS, + PGHOT_RECORDED_HWHINTS, +#endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/Kconfig b/mm/Kconfig index ebd8ea353687..4aeab6aee535 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1471,6 +1471,20 @@ config LAZY_MMU_MODE_KUNIT_TEST =20 If unsure, say N. =20 +config PGHOT + bool "Hot page tracking and promotion" + def_bool n + depends on NUMA && MIGRATION && SPARSEMEM && MMU + help + A sub-system to track page accesses in lower tier memory and + maintain hot page information. Promotes hot pages from lower + tiers to top tier by using the memory access information provided + by various sources. Asynchronous promotion is done by per-node + kernel threads. + + This adds 1 byte of metadata overhead per page in lower-tier + memory nodes. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..33014de43acc 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) +=3D tests/lazy_mmu_mode_kunit.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o diff --git a/mm/migrate.c b/mm/migrate.c index 94daec0f49ef..a5f48984ed3e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2606,7 +2606,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long= , nr_pages, return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags); } =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) /* * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which is crude. @@ -2726,12 +2726,10 @@ int migrate_misplaced_folio_prepare(struct folio *f= olio, */ int migrate_misplaced_folio(struct folio *folio, int node) { - pg_data_t *pgdat =3D NODE_DATA(node); int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); struct mem_cgroup *memcg =3D get_mem_cgroup_from_folio(folio); - struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); =20 list_add(&folio->lru, &migratepages); nr_remaining =3D migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2740,12 +2738,18 @@ int migrate_misplaced_folio(struct folio *folio, in= t node) if (nr_remaining && !list_empty(&migratepages)) putback_movable_pages(&migratepages); if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) - && node_is_toptier(node)) + && node_is_toptier(node)) { + pg_data_t *pgdat =3D NODE_DATA(node); + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); + mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded); + } +#endif } mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); @@ -2773,7 +2777,6 @@ int migrate_misplaced_folio(struct folio *folio, int = node) */ int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) { - pg_data_t *pgdat =3D NODE_DATA(node); struct mem_cgroup *memcg =3D NULL; unsigned int nr_succeeded =3D 0; int nr_remaining; @@ -2790,14 +2793,16 @@ int migrate_misplaced_folios_batch(struct list_head= *folio_list, int node) putback_movable_pages(folio_list); =20 if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); - mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded); +#endif } =20 mem_cgroup_put(memcg); WARN_ON(!list_empty(folio_list)); return nr_remaining ? -EAGAIN : 0; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ #endif /* CONFIG_NUMA */ diff --git a/mm/mm_init.c b/mm/mm_init.c index df34797691bd..c777c54cfe69 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1398,6 +1398,15 @@ static void pgdat_init_kcompactd(struct pglist_data = *pgdat) static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} #endif =20 +#ifdef CONFIG_PGHOT +static void pgdat_init_kmigrated(struct pglist_data *pgdat) +{ + init_waitqueue_head(&pgdat->kmigrated_wait); +} +#else +static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {} +#endif + static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { int i; @@ -1407,6 +1416,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + pgdat_init_kmigrated(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/pghot-default.c b/mm/pghot-default.c new file mode 100644 index 000000000000..e610062345e4 --- /dev/null +++ b/mm/pghot-default.c @@ -0,0 +1,79 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Default mode + * + * 1 byte hotness record per PFN. + * Bucketed time and frequency tracked as part of the record. + * Promotion to @pghot_target_nid by default. + */ + +#include +#include + +/* pghot-default doesn't store and hence no NID validation is required */ +bool pghot_nid_valid(int nid) +{ + return true; +} + +/* + * @time is regular time, @old_time is bucketed time. + */ +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + time &=3D PGHOT_TIME_BUCKETS_MASK; + old_time <<=3D PGHOT_TIME_BUCKETS_SHIFT; + + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now >> PGHOT_TIME_BUCKETS_SHIFT; + + old_hotness =3D READ_ONCE(*phi); + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D pghot_target_nid; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c new file mode 100644 index 000000000000..f04e2137309e --- /dev/null +++ b/mm/pghot-tunables.c @@ -0,0 +1,182 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot tunables in debugfs + */ +#include +#include +#include + +static struct dentry *debugfs_pghot; +static DEFINE_MUTEX(pghot_tunables_lock); + +static ssize_t pghot_freq_th_write(struct file *filp, const char __user *u= buf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int freq; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &freq)) + return -EINVAL; + + if (!freq || freq > PGHOT_FREQ_MAX) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_freq_threshold =3D freq; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_freq_th_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_freq_threshold); + return 0; +} + +static int pghot_freq_th_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_freq_th_show, NULL); +} + +static const struct file_operations pghot_freq_th_fops =3D { + .open =3D pghot_freq_th_open, + .write =3D pghot_freq_th_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static ssize_t pghot_target_nid_write(struct file *filp, const char __user= *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int nid; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &nid)) + return -EINVAL; + + if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid)) + return -EINVAL; + mutex_lock(&pghot_tunables_lock); + pghot_target_nid =3D nid; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_target_nid_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_target_nid); + return 0; +} + +static int pghot_target_nid_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_target_nid_show, NULL); +} + +static const struct file_operations pghot_target_nid_fops =3D { + .open =3D pghot_target_nid_open, + .write =3D pghot_target_nid_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static void pghot_src_enabled_update(unsigned int enabled) +{ + unsigned int changed =3D pghot_src_enabled ^ enabled; + + if (changed & PGHOT_HINTFAULTS_ENABLED) { + if (enabled & PGHOT_HINTFAULTS_ENABLED) + static_branch_enable(&pghot_src_hintfaults); + else + static_branch_disable(&pghot_src_hintfaults); + } + + if (changed & PGHOT_HWHINTS_ENABLED) { + if (enabled & PGHOT_HWHINTS_ENABLED) + static_branch_enable(&pghot_src_hwhints); + else + static_branch_disable(&pghot_src_hwhints); + } +} + +static ssize_t pghot_src_enabled_write(struct file *filp, const char __use= r *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int enabled; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 0, &enabled)) + return -EINVAL; + + if (enabled & ~PGHOT_SRC_ENABLED_MASK) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_src_enabled_update(enabled); + pghot_src_enabled =3D enabled; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_src_enabled_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%u\n", pghot_src_enabled); + return 0; +} + +static int pghot_src_enabled_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_src_enabled_show, NULL); +} + +static const struct file_operations pghot_src_enabled_fops =3D { + .open =3D pghot_src_enabled_open, + .write =3D pghot_src_enabled_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +void pghot_debug_init(void) +{ + debugfs_pghot =3D debugfs_create_dir("pghot", NULL); + debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL, + &pghot_src_enabled_fops); + debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL, + &pghot_target_nid_fops); + debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL, + &pghot_freq_th_fops); + debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot, + &kmigrated_sleep_ms); + debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot, + &kmigrated_batch_nr); +} diff --git a/mm/pghot.c b/mm/pghot.c new file mode 100644 index 000000000000..dac9e6f3b61e --- /dev/null +++ b/mm/pghot.c @@ -0,0 +1,479 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Maintains information about hot pages from slower tier nodes and + * promotes them. + * + * Per-PFN hotness information is stored for lower tier nodes in + * mem_section. + * + * In the default mode, a single byte (u8) is used to store + * the frequency of access and last access time. Promotions are done + * to a default toptier NID. + * + * A kernel thread named kmigrated is provided to migrate or promote + * the hot pages. kmigrated runs for each lower tier node. It iterates + * over the node's PFNs and migrates pages marked for migration into + * their targeted nodes. + */ +#include +#include +#include +#include +#include + +unsigned int pghot_target_nid =3D PGHOT_DEFAULT_NODE; +unsigned int pghot_src_enabled; +unsigned int pghot_freq_threshold =3D PGHOT_DEFAULT_FREQ_THRESHOLD; +unsigned int kmigrated_sleep_ms =3D KMIGRATED_DEFAULT_SLEEP_MS; +unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BATCH_NR; + +unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; + +DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); +DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); + +#ifdef CONFIG_SYSCTL +static const struct ctl_table pghot_sysctls[] =3D { + { + .procname =3D "pghot_promote_freq_window_ms", + .data =3D &sysctl_pghot_freq_window, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, +}; +#endif + +static bool kmigrated_started __ro_after_init; + +/** + * pghot_record_access() - Record page accesses from lower tier memory + * for the purpose of tracking page hotness and subsequent promotion. + * + * @pfn: PFN of the page + * @nid: Unused + * @src: The identifier of the sub-system that reports the access + * @now: Access time in jiffies + * + * Updates the frequency and time of access and marks the page as + * ready for migration if the frequency crosses a threshold. The pages + * marked for migration are migrated by kmigrated kernel thread. + * + * Return: 0 on success and -EINVAL on failure to record the access. + */ +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now) +{ + struct mem_section *ms; + struct folio *folio; + phi_t *phi, *hot_map; + struct page *page; + + if (!kmigrated_started) + return 0; + + if (!pghot_nid_valid(nid)) + return -EINVAL; + + switch (src) { + case PGHOT_HINTFAULTS: + if (!static_branch_unlikely(&pghot_src_hintfaults)) + return 0; + count_vm_event(PGHOT_RECORDED_HINTFAULTS); + break; + case PGHOT_HWHINTS: + if (!static_branch_unlikely(&pghot_src_hwhints)) + return 0; + count_vm_event(PGHOT_RECORDED_HWHINTS); + break; + default: + return -EINVAL; + } + + /* + * Record only accesses from lower tiers. + */ + if (node_is_toptier(pfn_to_nid(pfn))) + return 0; + + /* + * Reject the non-migratable pages right away. + */ + page =3D pfn_to_online_page(pfn); + if (!page || is_zone_device_page(page)) + return 0; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + return 0; + + if (unlikely(page_folio(page) !=3D folio)) + goto out; + + if (!folio_test_lru(folio)) + goto out; + + /* Get the hotness slot corresponding to the 1st PFN of the folio */ + pfn =3D folio_pfn(folio); + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + goto out; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + count_vm_event(PGHOT_RECORDED_ACCESSES); + + /* + * Update the hotness parameters. + */ + if (pghot_update_record(phi, nid, now)) { + set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map); + set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags); + } +out: + folio_put(folio); + return 0; +} + +static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, + unsigned long *time) +{ + phi_t *phi, *hot_map; + struct mem_section *ms; + + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + return -EINVAL; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + return pghot_get_record(phi, nid, freq, time); +} + +/* + * Walks the PFNs of the zone, isolates and migrates them in batches. + */ +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end= _pfn, + int src_nid) +{ + struct mem_cgroup *cur_memcg =3D NULL; + int cur_nid =3D NUMA_NO_NODE; + LIST_HEAD(migrate_list); + int batch_count =3D 0; + struct folio *folio; + struct page *page; + unsigned long pfn; + + pfn =3D start_pfn; + do { + int nid =3D NUMA_NO_NODE, nr =3D 1; + struct mem_cgroup *memcg; + unsigned long time =3D 0; + int freq =3D 0; + + if (!pfn_valid(pfn)) + goto out_next; + + page =3D pfn_to_online_page(pfn); + if (!page) + goto out_next; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + goto out_next; + + if (unlikely(page_folio(page) !=3D folio)) { + folio_put(folio); + goto out_next; + } + + nr =3D folio_nr_pages(folio); + if (folio_nid(folio) !=3D src_nid) { + folio_put(folio); + goto out_next; + } + + if (!folio_test_lru(folio)) { + folio_put(folio); + goto out_next; + } + + if (pghot_get_hotness(pfn, &nid, &freq, &time)) { + folio_put(folio); + goto out_next; + } + + if (nid =3D=3D NUMA_NO_NODE) + nid =3D pghot_target_nid; + + if (folio_nid(folio) =3D=3D nid) { + folio_put(folio); + goto out_next; + } + + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) { + folio_put(folio); + goto out_next; + } + + memcg =3D folio_memcg(folio); + if (cur_nid =3D=3D NUMA_NO_NODE) { + cur_nid =3D nid; + cur_memcg =3D memcg; + } + + /* If NID or memcg changed, flush the previous batch first */ + if (cur_nid !=3D nid || cur_memcg !=3D memcg) { + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + cur_nid =3D nid; + cur_memcg =3D memcg; + batch_count =3D 0; + cond_resched(); + } + + list_add(&folio->lru, &migrate_list); + folio_put(folio); + + if (++batch_count > kmigrated_batch_nr) { + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + batch_count =3D 0; + cond_resched(); + } +out_next: + pfn +=3D nr; + } while (pfn < end_pfn); + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); +} + +static void kmigrated_do_work(pg_data_t *pgdat) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + start_pfn =3D section_nr_to_pfn(section_nr); + ms =3D __nr_to_section(section_nr); + + if (!pfn_valid(start_pfn)) + continue; + + nid =3D pfn_to_nid(start_pfn); + if (node_is_toptier(nid) || nid !=3D pgdat->node_id) + continue; + + if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot= _map)) + continue; + + kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION, + pgdat->node_id); + } +} + +static inline bool kmigrated_work_requested(pg_data_t *pgdat) +{ + return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); +} + +/* + * Per-node kthread that iterates over its PFNs and migrates the + * pages that have been marked for migration. + */ +static int kmigrated(void *p) +{ + pg_data_t *pgdat =3D p; + + while (!kthread_should_stop()) { + long timeout =3D msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms)); + + if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(p= gdat), + timeout)) + kmigrated_do_work(pgdat); + } + return 0; +} + +static int kmigrated_run(int nid) +{ + pg_data_t *pgdat =3D NODE_DATA(nid); + int ret; + + if (node_is_toptier(nid)) + return 0; + + if (!pgdat->kmigrated) { + pgdat->kmigrated =3D kthread_create_on_node(kmigrated, pgdat, nid, + "kmigrated%d", nid); + if (IS_ERR(pgdat->kmigrated)) { + ret =3D PTR_ERR(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret); + return ret; + } + pr_info("pghot: Started kmigrated thread for node %d\n", nid); + } + wake_up_process(pgdat->kmigrated); + return 0; +} + +static void pghot_free_hot_map(struct mem_section *ms) +{ + kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK)); + ms->hot_map =3D NULL; +} + +static int pghot_alloc_hot_map(struct mem_section *ms, int nid) +{ + ms->hot_map =3D kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KE= RNEL, + nid); + if (!ms->hot_map) + return -ENOMEM; + return 0; +} + +static void pghot_offline_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + unsigned long start, end, pfn; + struct mem_section *ms; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + continue; + + pghot_free_hot_map(ms); + } +} + +static int pghot_online_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + int nid =3D pfn_to_nid(start_pfn); + unsigned long start, end, pfn; + struct mem_section *ms; + int fail =3D 0; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; !fail && pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || ms->hot_map) + continue; + + fail =3D pghot_alloc_hot_map(ms, nid); + } + + if (!fail) + return 0; + + /* rollback */ + end =3D pfn - PAGES_PER_SECTION; + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (ms && ms->hot_map) + pghot_free_hot_map(ms); + } + return -ENOMEM; +} + +static int pghot_memhp_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct memory_notify *mn =3D arg; + int ret =3D 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret =3D pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + case MEM_OFFLINE: + case MEM_CANCEL_ONLINE: + pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + } + + return notifier_from_errno(ret); +} + +static void pghot_destroy_hot_map(void) +{ + unsigned long section_nr, s_begin; + struct mem_section *ms; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + pghot_free_hot_map(ms); + } +} + +static int pghot_setup_hot_map(void) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + start_pfn =3D section_nr_to_pfn(section_nr); + nid =3D pfn_to_nid(start_pfn); + + if (node_is_toptier(nid) || !pfn_valid(start_pfn)) + continue; + + if (pghot_alloc_hot_map(ms, nid)) + goto out_free_hot_map; + } + hotplug_memory_notifier(pghot_memhp_callback, DEFAULT_CALLBACK_PRI); + return 0; + +out_free_hot_map: + pghot_destroy_hot_map(); + return -ENOMEM; +} + +static int __init pghot_init(void) +{ + pg_data_t *pgdat; + int nid, ret; + + ret =3D pghot_setup_hot_map(); + if (ret) + return ret; + + for_each_node_state(nid, N_MEMORY) { + ret =3D kmigrated_run(nid); + if (ret) + goto out_stop_kthread; + } + register_sysctl_init("vm", pghot_sysctls); + pghot_debug_init(); + + kmigrated_started =3D true; + return 0; + +out_stop_kthread: + for_each_node_state(nid, N_MEMORY) { + pgdat =3D NODE_DATA(nid); + if (pgdat->kmigrated) { + kthread_stop(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + } + } + pghot_destroy_hot_map(); + return ret; +} + +late_initcall_sync(pghot_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index 86b14b0f77b5..d3fbe2a5d0e6 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1486,6 +1486,11 @@ const char * const vmstat_text[] =3D { [I(KSTACK_REST)] =3D "kstack_rest", #endif #endif +#ifdef CONFIG_PGHOT + [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", + [I(PGHOT_RECORDED_HINTFAULTS)] =3D "pghot_recorded_hintfaults", + [I(PGHOT_RECORDED_HWHINTS)] =3D "pghot_recorded_hwhints", +#endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1