From nobody Thu Apr 2 14:10:30 2026 Received: from CO1PR03CU002.outbound.protection.outlook.com (mail-westus2azon11010045.outbound.protection.outlook.com [52.101.46.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 791EA33EAEA for ; Mon, 23 Mar 2026 09:51:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.46.45 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259509; cv=fail; b=AZaBMNn0tP58fcnpAzcOR2hoqxFpjbUCvppZ/TObHwDatETxpZDB8IArAC5tqg1n2HXuHtmMPlWTyAzUr/7x1mLslT39/ZR1fjkl0pJ5NiTxEQeRLb92wSzxKXgBQfyovWzmgowPuodBqdDDHkBmgd26jKoFDtY3Mapqc9myX0Q= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259509; c=relaxed/simple; bh=TfaoeEgbIDpxjsggh6jc2gPHJKSsP8mSWFdimGHzNpo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=PLQtv6USpaGV8dNzhcowICkDCqcdEsj9c8qYI2/g0sYbYdbDWz+mhdpt9hxy5VY5sPos5gaEaJOzFogI/gV8aTs3j8y30WOTAy02weeKOx32kR/SBuQ/RDNCc4HdXEbMJtq2Z+NedH+Dts/v7c0454AbFNWbzvl/MzvbOJnrgtE= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=g4XgEb1V; arc=fail smtp.client-ip=52.101.46.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="g4XgEb1V" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=pdV2pD9CIQP31ucEnDcH1PQUjBDt0R4RpsCsE8msxr0AcTCpb46LzDpenMklU3inXAzxu40/ZRJzadeNawvbiSaz4DYYpFYczsJqlDn3+7MDHNK7Qx7Cp7q5LCRV9hOeQD/4c7GGuv9UPqYQH6ystYTfzBUqio9ar1u3FAqmoqVlojLX271ZToLGkb9RFAbVBMn0lfjlcf4Cn6zYRnjY9JzZX2R//ANV/vXR+jeUI9j9KAmZ1KvOGZ5bLsQ9jA53bvQpZNc6QGJuKHwyCIalF7VAQM6rAWBSnXVto8lHo/e7Sm+1GDMNE6RSauHoc+cCHikf9wnQ/xWivv6HfQbKrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=iaATkYsbyA5exeptDiOgOt0HKO4PaYF80Qu0mf2RYzY=; b=ij4Lu4y/M5m5438pbFmXXSCWCsGWgeLPv0sjn5pEA89EvhyrViyUNbVj2+CZ9vHAICGeB6+C8z6nP/tF0Cy8kRc5zxyYwq4LFCzT8lZzVpS49UFjzWhvgUch+DbyT8+3yt9O7Ptskvd4l+WbNUnri03sWp9zCKlxTNy6lTZ8Wg7qD7Nh1FW/ObCwgrIN1q3oWtLesg4yizSvi8N6n8JwpcYy5z0zKo2Krm4nes6Oba5lR0tjeY76KybUHJMqKaNTizBmQiwRis56xqZCsWw9eHo4jiob5Dzqa8m4Aoo8CyeaaBWM6XRQ/JxQ5MzV6ru2ESrjTLBWXrttPXMNNqc+7g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=iaATkYsbyA5exeptDiOgOt0HKO4PaYF80Qu0mf2RYzY=; b=g4XgEb1VFMfRYC93DULzv5rqguaaq4fSBkT+iwyxo3TalXBz2XtnvIwxBm215+cQhh7nZe6U9TyKYmcYSY1SuuDby74dx/pwTNcMk7kg+Owwdm9Qwu5jTJhoRHEClAn19RriCCv6vj001HqQItHYlOVXk8abDgU7m1aZU0EiuwA= Received: from CH0PR03CA0409.namprd03.prod.outlook.com (2603:10b6:610:11b::17) by LV3PR12MB9094.namprd12.prod.outlook.com (2603:10b6:408:19e::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.20; Mon, 23 Mar 2026 09:51:41 +0000 Received: from CH2PEPF0000009F.namprd02.prod.outlook.com (2603:10b6:610:11b:cafe::4a) by CH0PR03CA0409.outlook.office365.com (2603:10b6:610:11b::17) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:51:41 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009F.mail.protection.outlook.com (10.167.244.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9723.19 via Frontend Transport; Mon, 23 Mar 2026 09:51:41 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:51:33 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Date: Mon, 23 Mar 2026 15:21:00 +0530 Message-ID: <20260323095104.238982-2-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009F:EE_|LV3PR12MB9094:EE_ X-MS-Office365-Filtering-Correlation-Id: 95f4ad32-587e-4153-ede4-08de88c1c8e0 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|1800799024|36860700016|7416014|376014|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: QQ88nRa0ldXBrAkqQpErlBo7VABkHPpsxgVkiHWyIedlfoy5th/ibB1+bowvxT/z1K3qIGHMucQRR2F0IYgDR3DKAahuQPNLL1ZTAy5odh6RQYB6rCPUBakKh5n/AM7VHUhiHmrqcFRsO3dmcKPb/A8W5lvMiW80rP0uEhh6Ud1f/sQy8ClJIFK5wdCGImS+mR7CX8JnD5/w9MiaRP488cL3ZZwWcCN3eehKnIpfG2IKeyPvIk6EgVAsliEl8kkI+70DImI41GNoK53/NGInsCrpMOUA+SnJ8MzNEa9qm+Qa1LaBRmfagpdu6MfieSXRtzDwCcAGDXRAs1nP1cqRcaW4mc8TYEv5SiIASR12vVk2V+1kp6FsF0tT4t4o4hzwz0a4ZwIWMWgHcpSmSrMiAoFONkW2P47mvrVYc3EO+aIUTYd1rHXRvrwjGh+kq6NfkHe7SqgWLxVlDWnhiQP7EswxlutQtLmmujNspq4kP4Hn+fd8Ora/nRZHEKyzn2JbJL8IzGQp6MnbWFw8yNukI2ZcFcfzLzsLiM/wv2Z+F4NJYTtEYUBK0DNR3LMFWF578TwlcD0Ep2m4RiPno+WgnuRDyw149XCD91WuVD8Q49Pqc2jalWMIbXRUZHYCUqOa1rxALOU6KKbFfeVxORan5k9+8YlSYFWM1ZWrXuLyEOJ0JKUzL+IhCujSg/5+qGRITXpvAS9weWSP6J4KQ+HleYEN7ZHRaWa6HhvNOfs2h4vKJbZNOpD2OPd0JBgrViUzVUHsUC/9oUvZNVvsBZKa/A== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(1800799024)(36860700016)(7416014)(376014)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: dwQJZiooeUKmJu1jSj2WKO0MGqjJyTDvGv2N/pvt9soZSNjEF/EUUcBwliIW/NYVhg2n0FC3RIvVeDpJMyTtOq+m1mxM5q54dopYZ4Ol+pap/z9mbtofS9AoCTu2ZJWSXGCN7iRNtdP6IhbmZ9M/ILq5EF8W9xRi1hG2HeVk7rqTySWMtIaOZ8vZFOVsCR9Y0D+wkJGu5DLX9kHxYrWSDsfIpSUYJiSXcyJcu3UKfsjtpwxLl1Ev9E+xoivnWtddCOyin3j8t5wfmVWFnJoNiKo89N/2dffoFIG4dZqmLngJ1Q4J8l3YsbKyBCTnvGJyMaQrp9wRJFpn9tII/bJoH8y26aHacjOwwgVdrymTZVV0ONlZvA4hiMWBPJeTEatt6IoKCV6rAFPzjjliZs5C7sJca2KeLH2JZt60NeijPxokYXPQR/nkGEBZnjLm0LmJ X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:51:41.2826 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 95f4ad32-587e-4153-ede4-08de88c1c8e0 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009F.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV3PR12MB9094 Content-Type: text/plain; charset="utf-8" We want isolation of misplaced folios to work in contexts where VMA isn't available, typically when performing migrations from a kernel thread context. In order to prepare for that, allow migrate_misplaced_folio_prepare() to be called with a NULL VMA. When migrate_misplaced_folio_prepare() is called with non-NULL VMA, it will check if the folio is mapped shared and that requires holding PTL lock. This path isn't taken when the function is invoked with NULL VMA (migration outside of process context). Therefore, when VMA =3D=3D NULL, migrate_misplaced_folio_prepare() does not require the caller to hold the PTL. Signed-off-by: Bharata B Rao --- mm/migrate.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 2c3d489ecf51..a15184950e65 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2652,7 +2652,12 @@ static struct folio *alloc_misplaced_dst_folio(struc= t folio *src, =20 /* * Prepare for calling migrate_misplaced_folio() by isolating the folio if - * permitted. Must be called with the PTL still held. + * permitted. Must be called with the PTL still held if called with a non-= NULL + * vma. + * + * When called with a NULL vma (e.g., kernel thread initiated migration), + * migrate_misplaced_folio_prepare() will allow shared executable folios + * to be migrated. */ int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -2669,7 +2674,7 @@ int migrate_misplaced_folio_prepare(struct folio *fol= io, * See folio_maybe_mapped_shared() on possible imprecision * when we cannot easily detect if a folio is shared. */ - if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) + if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) return -EACCES; =20 /* --=20 2.34.1 From nobody Thu Apr 2 14:10:30 2026 Received: from BL0PR03CU003.outbound.protection.outlook.com (mail-eastusazon11012033.outbound.protection.outlook.com [52.101.53.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F1674382F1C for ; Mon, 23 Mar 2026 09:51:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.53.33 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259516; cv=fail; b=R65mZTKNYD3Yvh8yhStYdoE4MKhcvf3cUXHXDiJMwwe4fuHG7XC0PK/53zK65DnRxx8+bHzojaZu73meCKc7nol+ILm2QF09wWDCXQIywAw3y60mOMm72+Vc9QuUx7KV8DuJDzNI3I2Da61ukJ40qT3HC68UmPswpzsnDPxrUP4= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259516; c=relaxed/simple; bh=4IuoisM4S11H8ttZevgzrsJzB6X3fXNCuy7Y6HSQeT0=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=rOGWh8YYyHlPLWkN8PwjSgNfs4eaY0ueYQqt2nue6FkgWU0ypuY8e0mfw/1iPbCijQiSWqGDk5CAg0r45KB1rqJRTah92L+VALSsV7v3E91/mgiaAiSBNfKzFieEtiwFrnE5Hm2hV4eFv+DAC5SoCyOJ3LuIwyd3dQbqxLAKJMU= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=ZGX6S6Bx; arc=fail smtp.client-ip=52.101.53.33 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="ZGX6S6Bx" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=cV5Hda47gfi3fJhNp8BVJlxy+WPdEv8QO3r9fJrff4sZHsRsG/RXb4jFBnIJYLiF/TfpcESYSEujI7VcRKGyBfS0TPwRhghTh5bbTI/wn7om7sBu9UrK+gzE1oKktaQ8euJoGop+5k6IMHqtOfyz8yfToY5y9J9/BDxwIDr2tgkLkYC+TyXc2AB0t3orDt/8AuAGw5BDo13TEm32hAYlQyw7an1cv4BsJqkTNHXJbLYRs6FllBlakXtI92l2FQXXb+EdnWuEFzq9/kCw7dwGnA04esKNDYuMPIdDBuAzV69ZHVOn+QaJIrsDMF0lVNIAeK0Gby2lO9bugIaeAQgJyw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GIcjakd4JDTQOgrx3Q7lwrKhTaLitnBcqdfcfsLenks=; b=KMrfVBhDbZEdW/hc5aT9HkN9/0QNpWySq3+OqUga6O82mLYzow4a3xkuK6o75k/vYcvtzOR0qjhOz6YBb1NUrft/nIjjnk5PVS9c373q70NPi3yhLTdqmN1jsn1VV3i+2Cm2W6G3QimfvZJ0dDk55qFeRGaXPvWueu6GxAb+V8URVJQ2a4THdxJCw/VKvAbpUj8I253JCrYxlt/13FawTshiThp6HrCxq2SLL/exHDESBFaQiCcNwNWOb018Ra1K8+4vFISzRlV77rZH3MwYnpcj58sIpYswqAnt3zHfV+WvdQibwV83q3GkLzaqPr6+n9ppFblGDwFC3K7zWu0GyA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=GIcjakd4JDTQOgrx3Q7lwrKhTaLitnBcqdfcfsLenks=; b=ZGX6S6BxzQznT62jmvvVpWMAnHlWoQI5kVOc8zHJczalWvH4KsTLz/HeLuFBUqLt/OWwbF28aY+3WNZr0gjE6Qf45xDxC6T207Z9a0rYsE8CKCkxUBn6dwgskSPOcxXhxdlsuP73hu8mnLO2rWg34+Wbrd/uqbrQ00eAMOBzxDA= Received: from CH0P221CA0039.NAMP221.PROD.OUTLOOK.COM (2603:10b6:610:11d::21) by IA4PR12MB9761.namprd12.prod.outlook.com (2603:10b6:208:550::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.20; Mon, 23 Mar 2026 09:51:49 +0000 Received: from CH2PEPF0000009A.namprd02.prod.outlook.com (2603:10b6:610:11d:cafe::3b) by CH0P221CA0039.outlook.office365.com (2603:10b6:610:11d::21) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:51:49 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009A.mail.protection.outlook.com (10.167.244.22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9723.19 via Frontend Transport; Mon, 23 Mar 2026 09:51:48 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:51:41 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Date: Mon, 23 Mar 2026 15:21:01 +0530 Message-ID: <20260323095104.238982-3-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009A:EE_|IA4PR12MB9761:EE_ X-MS-Office365-Filtering-Correlation-Id: 94bf9251-7dbd-44e2-fce8-08de88c1cd5e X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|82310400026|7416014|376014|36860700016|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: 9LD7FEsQBe1u2+fp4qBx36QBS4rYMOR08vROi2Hu8c375pqNG12UPihXd4N8EqM6936KIBF2X6yg84+CDVIA2ItnAsRMNBXZPfjCyxhv0oW4mvpr78kJq3BkfPu3dq258zTc4RdOMhMO704tQ2Uhg34nz8bvbrcZrflqbbNwMF+t9E1iQ+90dj6+qa3S96E9os6iMF48/zybzArZdvZ3N5kgUrBsxYFXx2tODpmgvkNhcbkyjD7gNX/N/6TT8mNR8Yrms6mLZLMa+pxxHz7hU4gpTLt9M6mXzBh8grWo1VJJ0z4CjQuJfwbU59/C46bOzjgUKA0TyeOEFeIuGHLOaaauxQKyvkfwBGxYfiBcLJL4lVPuBFKA0F7i2iuv9bs0XxqbjAtEUC1NCxFg+2FSj88qoS8i0/LOy+5lB8RWFCo4MuKi2E3TbG0HCsNffxZ7ZJEof//NuduJPZyaBEC65J2dnhQDlbsijaT64q/swpYTgon8utWXqH5fN33DuMRfiJq5xtRWC263xSKUhfxDsY8Bv1CESf/ZqHs294Poje8ssg3RaiIIgM7sl4YlI7Fdpm/pJDJxgXDP6YSwPu16FdC2A+8p+NvaO2YLPuZMBn1QiJ3weThH1VVvef4Ak9HB5WXqmJJ82ocNTEbhr7513MOCY8qL4ar19MjeD1Y4LCB4FnwwUt6o0mKE4LuoOA/CI76DPLU1+qpM33ECwg4RogVwyxvJPhNXFlRGd6JQCzIeWDdUdsi9UW5CkYcKuu95L8zG57QbDcvchW97HpGKxg== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(82310400026)(7416014)(376014)(36860700016)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: a9APMK6EB36du1JOR1gkV/XkwnG9HCtuzyPa5QyOH6dfDHM84Rp9Nov5YUwm2t/OTlznpojmXhLoFbPEdy5P2GTFf2P9Lew/6RGFJSZvM9JmI/8s/tiw87YpGSGcNlri7+EBSLBse+UJKmChR+G7Plk9ON0ym241v4rb4Kb2WVqHh99OcUxw+ivMGuoQNnGNBDH7bHrglmUV5otesWGyJHDJg8uZDIsL7gvoUPjHO9pikpDdvCr0qXEAfmkcwJFt0238f3vhiT68tnGqJuqFPYOZ87ZYYMUEfNPhZz6eR2uN8W+eMGa9nb0B/y/GAO0J4MXRiiqzOCiPwIHQFIcQ0Va4mq4Q5stpqEGIJ8DreNWAV/1MoDSuQQETW+mtBUXtBposF5zx5Ip5fD4KuVQ0ckKZbfiEo8OKR1rKaz3FvYHztsSy3V9IAvTlTJs6tdJZ X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:51:48.8266 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 94bf9251-7dbd-44e2-fce8-08de88c1cd5e X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009A.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA4PR12MB9761 Content-Type: text/plain; charset="utf-8" From: Gregory Price Tiered memory systems often require migrating multiple folios at once. Currently, migrate_misplaced_folio() handles only one folio per call, which is inefficient for batch operations. This patch introduces migrate_misplaced_folios_batch(), a batch variant that leverages migrate_pages() internally for improved performance. The caller must isolate folios beforehand using migrate_misplaced_folio_prepare(). On return, the folio list will be empty regardless of success or failure. This function will be used by pghot kmigrated thread. Signed-off-by: Gregory Price [Rewrote commit description] Signed-off-by: Bharata B Rao --- include/linux/migrate.h | 6 ++++++ mm/migrate.c | 48 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 54 insertions(+) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index d5af2b7f577b..5c1e2691cec2 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -111,6 +111,7 @@ static inline void softleaf_entry_wait_on_locked(softle= af_t entry, spinlock_t *p int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node); #else static inline int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -121,6 +122,11 @@ static inline int migrate_misplaced_folio(struct folio= *folio, int node) { return -EAGAIN; /* can't migrate now */ } +static inline int migrate_misplaced_folios_batch(struct list_head *folio_l= ist, + int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_MIGRATION diff --git a/mm/migrate.c b/mm/migrate.c index a15184950e65..94daec0f49ef 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2751,5 +2751,53 @@ int migrate_misplaced_folio(struct folio *folio, int= node) BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } + +/** + * migrate_misplaced_folios_batch() - Batch variant of migrate_misplaced_f= olio + * Attempts to migrate a folio list to the specified destination. + * @folio_list: Isolated list of folios to be batch-migrated. + * @node: The NUMA node ID to where the folios should be migrated. + * + * Caller is expected to have isolated the folios by calling + * migrate_misplaced_folio_prepare(), which will result in an + * elevated reference count on the folio. All the isolated folios + * in the list must belong to the same memcg so that NUMA_PAGE_MIGRATE + * stat can be attributed correctly to the memcg. + * + * This function will un-isolate the folios, drop the elevated reference + * and remove them from the list before returning. This is called + * only for batched promotion of hot pages from lower tier nodes. + * + * Return: 0 on success and -EAGAIN on failure or partial migration. + * On return, @folio_list will be empty regardless of success/fail= ure. + */ +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) +{ + pg_data_t *pgdat =3D NODE_DATA(node); + struct mem_cgroup *memcg =3D NULL; + unsigned int nr_succeeded =3D 0; + int nr_remaining; + + if (!list_empty(folio_list)) { + struct folio *first =3D list_first_entry(folio_list, struct folio, lru); + memcg =3D get_mem_cgroup_from_folio(first); + } + + nr_remaining =3D migrate_pages(folio_list, alloc_misplaced_dst_folio, + NULL, node, MIGRATE_ASYNC, + MR_NUMA_MISPLACED, &nr_succeeded); + if (nr_remaining) + putback_movable_pages(folio_list); + + if (nr_succeeded) { + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); + } + + mem_cgroup_put(memcg); + WARN_ON(!list_empty(folio_list)); + return nr_remaining ? -EAGAIN : 0; +} #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ --=20 2.34.1 From nobody Thu Apr 2 14:10:30 2026 Received: from SJ2PR03CU001.outbound.protection.outlook.com (mail-westusazon11012065.outbound.protection.outlook.com [52.101.43.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15DF5388375 for ; Mon, 23 Mar 2026 09:52:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.43.65 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259530; cv=fail; b=niAxJNgOYYwxcOacamz5CCRFpsHNDRi6qpXQ9mSLlC85YYdCeFg9u8TZKz0Y7yOxMWA+8wDaen4Uk8qBWEBqKHZUDXVC/TCh+dBzsiiaYYaP/R8MRiJJz+t5xPE+Dg7v8Y79OFMjRvtyFPksMU3gsJlTk5tU3r9dTi8Jt0aEP+g= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259530; c=relaxed/simple; bh=yIp+ber6dsyHKzQZxcb1iBQJdGf44/ppPzApv1gesVU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XzKrBQxLa4heCvIId5o38vFDUzlrJrj82QAedUKNwQZNBfCkOQ2/9IMVBIGmr9G3/1IheUlmKqjp0Mv4WWOstzTTxImDZXK4pETkR9tE4jmwyKcz6avVsXvodUL8OogjOQMoe0IDUHBY7cSNuZHA8EhWLTFkNOHf6nLc8TPXg7A= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=M+8BBici; arc=fail smtp.client-ip=52.101.43.65 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="M+8BBici" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=fwgWxGFPFVT44stZr+YoQiCeexwgrrpxRk7kYCJz2DBE9vK/sVr5otu2uQsxpwTNB5JBO0rqvVxEjeO7F7nQxoh34exWaGzpVvOXACuic+0FdBsWch9zO09+GqwU9hmUpdEllanZo80RW5z7BOY3B0aDHxsDAar7kK6BLsgqeMy4MlQfpmqQYZiXAku54md9qUTGnosOQCLVJgLVmYlgkYX7UgOObgH9aN9E/ALKRgSdjl4hxObQT+2/1pu28RGI3XOmRPIc6SHJP7e+G5RBkMnBWccE7cV6XmwrZw60WHC5C8T/K0u8Iis+ZHz1pbn4hrf9isWhgMRjwCNCC6/G2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=HaQ71kQlJc9KGhasWhyoMsrC4mXqbCyWzXUFOmtNu6Y=; b=B6uHiAtrKdQVt9sQ1HlLCMueSahHwgBTEw7zUO2q/WnBYt1mAxoyvG7MJyyFi+U2WCAPpAfn2n3ilUlT19ezKNUKyUEVCxhqwVfbYPtkc0Dhnz2Frhf9oAPc5Pk80BfDAkEnNgfKysLPVGfiJwnJzLDrzLUjq5iTnpw/34McWFtu1Sq4Whkyeb/kQTCI07pJW0/Tm0vzIoQE7iOIrfqw7WRmvIU4vSfPu/p1i2sWnatxOeTcmuRw90OfmjFVxVGylrJwuyJBCEEXgnISMUEqKdOWVtFatRwK8XapQCvYslJJO6zRKNrwSBupRUijGrrueT8W7Q+WFPUMtY3B8mRamg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HaQ71kQlJc9KGhasWhyoMsrC4mXqbCyWzXUFOmtNu6Y=; b=M+8BBici8p9wiFB332HtWUkX7UzyyyRmqwH+JXMGNCiOuhzBS4XVXXgaYAu3zOWK2p3LNuicNqMmPRKyhBpLdnvRPYMec5kocBihXF7qrP/X0Y+UnW3Wy9oXGZZdiEjAtxmY64+6fmvU9MH8SIdFdhdTMWOy9GdsowApxecLQtc= Received: from CH0PR03CA0405.namprd03.prod.outlook.com (2603:10b6:610:11b::15) by SA0PR12MB4480.namprd12.prod.outlook.com (2603:10b6:806:99::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.9; Mon, 23 Mar 2026 09:51:58 +0000 Received: from CH2PEPF0000009F.namprd02.prod.outlook.com (2603:10b6:610:11b:cafe::67) by CH0PR03CA0405.outlook.office365.com (2603:10b6:610:11b::15) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:51:56 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009F.mail.protection.outlook.com (10.167.244.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9723.19 via Frontend Transport; Mon, 23 Mar 2026 09:51:56 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:51:48 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Date: Mon, 23 Mar 2026 15:21:02 +0530 Message-ID: <20260323095104.238982-4-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009F:EE_|SA0PR12MB4480:EE_ X-MS-Office365-Filtering-Correlation-Id: 529e6e15-38b4-468f-5b9d-08de88c1d227 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|1800799024|36860700016|7416014|376014|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: uGnrJGjBWzJhL52jvIFK4yYtIBDixzmkOEku0IQImqgqhGVXAJ2QVeeOTw+tfyQBshkfHw+PA6J4OyHAd2aGE+z2BuxIjs+FXG6oZJtUBEA9EWsJRO/4lH7BGT2MhfmTPeiCGVeWQBYR0eKUcvQX28ZHGZD5LR24hjSP0JSBw8vKWJ09MNXt1fmYwrQQwicprjKZjH294JyPXkzlnm9JVESEJ886midyHPgxFmiDvH/EndIkvyArxaoZZLg7VAV0sMitFxPDJNk4aYAjbDfevvZt9YVFrTI9BMRU8YhZR6hmwcDIZhPRwWLXDnynwohFMVMka9b1Rp/psMlULOHRe4x62Y5tLN7zVEbRubETvMuuWHlPEgCAhJVJmiblMHw3NM5KtYIuVeRmM8QWvMetmGwxdH/1GoP/gcU4Esly7ZwSY98wVoiSJSrtk20oy5b//eOJtv+8tMR+7RpJIJWl82U4eQ8qK5JZHk9PPTDh3BiOpfEggBTClojJAh9z0TA91uhNa95EUtamaBOkUuoUw6mYc+ZYk5MAufVzU9QRrAKNMoHIxYy1WsArSBLQutSRJECi6MX1ZA3wset97TPiZln3sYdkg6lDYrnxLQcma0rJVTQlPI65SX/W4ERxqVQmDW1HDeNVDT0zs2gmJGeSzlbvEFyhgDXyzGidv2mElvpp3rGotzfuA/XoA0usUFXCOeNLPpSAs8Y3aQP/0WVmnhyfiviyptINMQGZagjYaWrdUpi+2YYkSsmZ49b4BmNEE+I5f5SfwziZ807kTky+Qw== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(1800799024)(36860700016)(7416014)(376014)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: gfTxkhiZpYpPbu82WiY7Wg9z4I2v5JIBkSFKIe/MXKGXGq/0f3mZEJ+sKWK4WPtwD+URjwMAp0lZY0giV0JC5VDOSA51MIxnq2xYnGcLy5rNXVmudw6xmb0Le2Q27W9ziqsC0d9wo0HYmEFW7AKd/MLyIxu+82cKdu9UDRQlXa/ECjQPMtKn8JERhsOxDqucteWvOG8JcVcoaQYWRdFsQwTpV3FJDgSr033da0K9E4bA6D+1SfuJRhemezhhlAhLXfC41E2XOuPi9uwSiONUayeks6AMjSuCJcM70pguUoglkn7iKFmMNj+yhzQ5gh430xplGofpVtUquZOlXRzlflM1YAzFg7YdeL9pQ5zydeJ/nYM6+PQmHwnwhlh4BN2fdxCzUbA9cuJIM6q9mJ+0fqdshGGMukJsywiJdPy5bmQi7RAPhUXmKNT++xA1If0W X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:51:56.8498 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 529e6e15-38b4-468f-5b9d-08de88c1d227 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009F.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA0PR12MB4480 pghot is a subsystem that collects memory access information from multiple sources, classifies hot pages resident in lower-tier memory, and promotes them to faster tiers. It stores per-PFN hotness metadata and performs asynchronous, batched promotion via a per-lower-tier-node kernel thread (kmigrated). This change introduces the default (compact) mode of pghot: - Per-PFN hotness record (phi_t =3D u8) embedded via mem_section: - 2 bits: access frequency (4 levels) - 5 bits: time bucket (=E2=89=884s window with HZ=3D1000, bucketed jiffie= s) - 1 bit : migration-ready flag (MSB) The LSB of mem_section->hot_map pointer is used as a per-section "hot" flag to gate scanning. - Event recording API: int pghot_record_access(unsigned long pfn, int nid, int src, unsigned lon= g now) @pfn: The PFN of the memory accessed @nid: The accessing NUMA node ID @src: The temperature source (subsystem) that generated the access info @time: The access time in jiffies - Sources (e.g., NUMA hint faults, HW hints) call this to report accesses. - In default mode, the nid is not stored/used for targeting; promotion goes to a configurable toptier node (pghot_target_nid). - Promotion engine: - One kmigrated thread per lower-tier node. - Scans only sections whose "hot" flag was raised, iterates PFNs, and batches candidates by destination node. - Uses migrate_misplaced_folios_batch() to move batched folios. - Tunables & stats: - debugfs: enabled_sources, target_nid, freq_threshold, kmigrated_sleep_ms, kmigrated_batch_nr - sysctl : vm.pghot_promote_freq_window_ms - vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults, pghot_recorded_hwhints Memory overhead --------------- Default mode uses 1 byte of hotness metadata per PFN on lower-tier nodes. Behavior & policy ----------------- - Default mode promotion target: The nid passed by sources is not stored; hot pages promote to pghot_target_nid (toptier). Precision mode (added later in the series) changes this. - Record consumption: kmigrated consumes (clears) the "migration-ready" bit before attempting isolation. If isolation/migration fails, the folio is not re-queued automatically; subsequent accesses will re-arm it. This avoids retry storms and keeps batching stable. - Wakeups: kmigrated wakeups are intentionally timeout-driven in v6. We set the per-pgdat "activate" flag on access, and kmigrated checks this flag on its next sleep interval. This keeps the first cut simple and avoids potential wake storms; active wakeups can be considered in a follow-up. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 80 +++++ include/linux/migrate.h | 4 +- include/linux/mmzone.h | 20 ++ include/linux/pghot.h | 82 +++++ include/linux/vm_event_item.h | 5 + mm/Kconfig | 14 + mm/Makefile | 1 + mm/migrate.c | 19 +- mm/mm_init.c | 10 + mm/pghot-default.c | 79 ++++ mm/pghot-tunables.c | 182 ++++++++++ mm/pghot.c | 479 +++++++++++++++++++++++++ mm/vmstat.c | 5 + 13 files changed, 971 insertions(+), 9 deletions(-) create mode 100644 Documentation/admin-guide/mm/pghot.txt create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot-default.c create mode 100644 mm/pghot-tunables.c create mode 100644 mm/pghot.c diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt new file mode 100644 index 000000000000..5f51dd1d4d45 --- /dev/null +++ b/Documentation/admin-guide/mm/pghot.txt @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D +PGHOT: Hot Page Tracking Tunables +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D +The PGHOT subsystem tracks frequently accessed pages in lower-tier memory = and +promotes them to faster tiers. It uses per-PFN hotness metadata and asynch= ronous +migration via per-node kernel threads (kmigrated). + +This document describes tunables available via **debugfs** and **sysctl** = for +PGHOT. + +Debugfs Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Path: /sys/kernel/debug/pghot/ + +1. **enabled_sources** + - Bitmask to enable/disable hotness sources. + - Bits: + - 0: Hint faults (value 0x1) + - 1: Hardware hints (value 0x2) + - Default: 0 (disabled) + - Example: + # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources + Enables all sources. + +2. **target_nid** + - Toptier NUMA node ID to which hot pages should be promoted when source + does not provide nid. Used when hotness source can't provide accessing + NID or when the tracking mode is default. + - Default: 0 + - Example: + # echo 1 > /sys/kernel/debug/pghot/target_nid + +3. **freq_threshold** + - Minimum access frequency before a page is marked ready for promotion. + - Range: 1 to 3 + - Default: 2 + - Example: + # echo 3 > /sys/kernel/debug/pghot/freq_threshold + +4. **kmigrated_sleep_ms** + - Sleep interval (ms) for kmigrated thread between scans. + - Default: 100 + +5. **kmigrated_batch_nr** + - Maximum number of folios migrated in one batch. + - Default: 512 + +Sysctl Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +1. pghot_promote_freq_window_ms + +Path: /proc/sys/vm/pghot_promote_freq_window_ms + +- Controls the time window (in ms) for counting access frequency. A page is + considered hot only when **freq_threshold** number of accesses occur with + this time period. +- Default: 3000 (3 seconds) +- Example: + # sysctl vm.pghot_promote_freq_window_ms=3D3000 + +Vmstat Counters +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Following vmstat counters provide some stats about pghot subsystem. + +Path: /proc/vmstat + +1. **pghot_recorded_accesses** + - Number of total hot page accesses recorded by pghot. + +2. **pghot_recorded_hintfaults** + - Number of recorded accesses reported by NUMA Balancing based + hotness source. + +3. **pghot_recorded_hwhints** + - Number of recorded accesses reported by hwhints source. diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 5c1e2691cec2..7f912b6ebf02 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softle= af_t entry, spinlock_t *p =20 #endif /* CONFIG_MIGRATION */ =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); @@ -127,7 +127,7 @@ static inline int migrate_misplaced_folios_batch(struct= list_head *folio_list, { return -EAGAIN; /* can't migrate now */ } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ =20 #ifdef CONFIG_MIGRATION =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3e51190a55e4..d7ed60956543 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1064,6 +1064,7 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_KMIGRATED_ACTIVATE, /* activates kmigrated */ }; =20 enum zone_flags { @@ -1518,6 +1519,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif +#ifdef CONFIG_PGHOT + struct task_struct *kmigrated; + wait_queue_head_t kmigrated_wait; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) @@ -1930,12 +1935,27 @@ struct mem_section { unsigned long section_mem_map; =20 struct mem_section_usage *usage; +#ifdef CONFIG_PGHOT + /* + * Per-PFN hotness data for this section. + * Array of phi_t (u8 in default mode). + * LSB is used as PGHOT_SECTION_HOT_BIT flag. + */ + void *hot_map; +#endif #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; +#endif + /* + * Padding to maintain consistent mem_section size when exactly + * one of PGHOT or PAGE_EXTENSION is enabled. This ensures + * optimal alignment regardless of configuration. + */ +#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION)) unsigned long pad; #endif /* diff --git a/include/linux/pghot.h b/include/linux/pghot.h new file mode 100644 index 000000000000..525d4dd28fc1 --- /dev/null +++ b/include/linux/pghot.h @@ -0,0 +1,82 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PGHOT_H +#define _LINUX_PGHOT_H + +/* Page hotness temperature sources */ +enum pghot_src { + PGHOT_HINTFAULTS =3D 0, + PGHOT_HWHINTS, + PGHOT_SRC_MAX +}; + +#ifdef CONFIG_PGHOT +#include + +extern unsigned int pghot_target_nid; +extern unsigned int pghot_src_enabled; +extern unsigned int pghot_freq_threshold; +extern unsigned int kmigrated_sleep_ms; +extern unsigned int kmigrated_batch_nr; +extern unsigned int sysctl_pghot_freq_window; + +void pghot_debug_init(void); + +DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults); +DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); + +#define PGHOT_HINTFAULTS_ENABLED BIT(PGHOT_HINTFAULTS) +#define PGHOT_HWHINTS_ENABLED BIT(PGHOT_HWHINTS) +#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_SRC_MAX - 1, 0) + +#define PGHOT_DEFAULT_FREQ_THRESHOLD 2 + +#define KMIGRATED_DEFAULT_SLEEP_MS 100 +#define KMIGRATED_DEFAULT_BATCH_NR 512 + +#define PGHOT_DEFAULT_NODE 0 + +#define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC) + +/* + * Bits 0-6 are used to store frequency and time. + * Bit 7 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 7 + +#define PGHOT_FREQ_WIDTH 2 +/* Bucketed time is stored in 5 bits which can represent up to 3.9s with H= Z=3D1000 */ +#define PGHOT_TIME_BUCKETS_SHIFT 7 +#define PGHOT_TIME_WIDTH 5 +#define PGHOT_NID_WIDTH 10 + +#define PGHOT_FREQ_SHIFT 0 +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) +#define PGHOT_TIME_BUCKETS_MASK (PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SH= IFT) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u8 phi_t; + +#define PGHOT_RECORD_SIZE sizeof(phi_t) + +#define PGHOT_SECTION_HOT_BIT 0 +#define PGHOT_SECTION_HOT_MASK BIT(PGHOT_SECTION_HOT_BIT) + +bool pghot_nid_valid(int nid); +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime); +bool pghot_update_record(phi_t *phi, int nid, unsigned long now); +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time); + +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now); +#else +static inline int pghot_record_access(unsigned long pfn, int nid, int src,= unsigned long now) +{ + return 0; +} +#endif /* CONFIG_PGHOT */ +#endif /* _LINUX_PGHOT_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 22a139f82d75..4ce670c1bb02 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -188,6 +188,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSTACK_REST, #endif #endif /* CONFIG_DEBUG_STACK_USAGE */ +#ifdef CONFIG_PGHOT + PGHOT_RECORDED_ACCESSES, + PGHOT_RECORDED_HINTFAULTS, + PGHOT_RECORDED_HWHINTS, +#endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/Kconfig b/mm/Kconfig index ebd8ea353687..4aeab6aee535 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1471,6 +1471,20 @@ config LAZY_MMU_MODE_KUNIT_TEST =20 If unsure, say N. =20 +config PGHOT + bool "Hot page tracking and promotion" + def_bool n + depends on NUMA && MIGRATION && SPARSEMEM && MMU + help + A sub-system to track page accesses in lower tier memory and + maintain hot page information. Promotes hot pages from lower + tiers to top tier by using the memory access information provided + by various sources. Asynchronous promotion is done by per-node + kernel threads. + + This adds 1 byte of metadata overhead per page in lower-tier + memory nodes. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..33014de43acc 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) +=3D tests/lazy_mmu_mode_kunit.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o diff --git a/mm/migrate.c b/mm/migrate.c index 94daec0f49ef..a5f48984ed3e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2606,7 +2606,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long= , nr_pages, return kernel_move_pages(pid, nr_pages, pages, nodes, status, flags); } =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) /* * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which is crude. @@ -2726,12 +2726,10 @@ int migrate_misplaced_folio_prepare(struct folio *f= olio, */ int migrate_misplaced_folio(struct folio *folio, int node) { - pg_data_t *pgdat =3D NODE_DATA(node); int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); struct mem_cgroup *memcg =3D get_mem_cgroup_from_folio(folio); - struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); =20 list_add(&folio->lru, &migratepages); nr_remaining =3D migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2740,12 +2738,18 @@ int migrate_misplaced_folio(struct folio *folio, in= t node) if (nr_remaining && !list_empty(&migratepages)) putback_movable_pages(&migratepages); if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) - && node_is_toptier(node)) + && node_is_toptier(node)) { + pg_data_t *pgdat =3D NODE_DATA(node); + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); + mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded); + } +#endif } mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); @@ -2773,7 +2777,6 @@ int migrate_misplaced_folio(struct folio *folio, int = node) */ int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) { - pg_data_t *pgdat =3D NODE_DATA(node); struct mem_cgroup *memcg =3D NULL; unsigned int nr_succeeded =3D 0; int nr_remaining; @@ -2790,14 +2793,16 @@ int migrate_misplaced_folios_batch(struct list_head= *folio_list, int node) putback_movable_pages(folio_list); =20 if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); - mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded); +#endif } =20 mem_cgroup_put(memcg); WARN_ON(!list_empty(folio_list)); return nr_remaining ? -EAGAIN : 0; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ #endif /* CONFIG_NUMA */ diff --git a/mm/mm_init.c b/mm/mm_init.c index df34797691bd..c777c54cfe69 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1398,6 +1398,15 @@ static void pgdat_init_kcompactd(struct pglist_data = *pgdat) static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} #endif =20 +#ifdef CONFIG_PGHOT +static void pgdat_init_kmigrated(struct pglist_data *pgdat) +{ + init_waitqueue_head(&pgdat->kmigrated_wait); +} +#else +static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {} +#endif + static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { int i; @@ -1407,6 +1416,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + pgdat_init_kmigrated(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/pghot-default.c b/mm/pghot-default.c new file mode 100644 index 000000000000..e610062345e4 --- /dev/null +++ b/mm/pghot-default.c @@ -0,0 +1,79 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Default mode + * + * 1 byte hotness record per PFN. + * Bucketed time and frequency tracked as part of the record. + * Promotion to @pghot_target_nid by default. + */ + +#include +#include + +/* pghot-default doesn't store and hence no NID validation is required */ +bool pghot_nid_valid(int nid) +{ + return true; +} + +/* + * @time is regular time, @old_time is bucketed time. + */ +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + time &=3D PGHOT_TIME_BUCKETS_MASK; + old_time <<=3D PGHOT_TIME_BUCKETS_SHIFT; + + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now >> PGHOT_TIME_BUCKETS_SHIFT; + + old_hotness =3D READ_ONCE(*phi); + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D pghot_target_nid; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c new file mode 100644 index 000000000000..f04e2137309e --- /dev/null +++ b/mm/pghot-tunables.c @@ -0,0 +1,182 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot tunables in debugfs + */ +#include +#include +#include + +static struct dentry *debugfs_pghot; +static DEFINE_MUTEX(pghot_tunables_lock); + +static ssize_t pghot_freq_th_write(struct file *filp, const char __user *u= buf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int freq; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &freq)) + return -EINVAL; + + if (!freq || freq > PGHOT_FREQ_MAX) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_freq_threshold =3D freq; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_freq_th_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_freq_threshold); + return 0; +} + +static int pghot_freq_th_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_freq_th_show, NULL); +} + +static const struct file_operations pghot_freq_th_fops =3D { + .open =3D pghot_freq_th_open, + .write =3D pghot_freq_th_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static ssize_t pghot_target_nid_write(struct file *filp, const char __user= *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int nid; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &nid)) + return -EINVAL; + + if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid)) + return -EINVAL; + mutex_lock(&pghot_tunables_lock); + pghot_target_nid =3D nid; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_target_nid_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_target_nid); + return 0; +} + +static int pghot_target_nid_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_target_nid_show, NULL); +} + +static const struct file_operations pghot_target_nid_fops =3D { + .open =3D pghot_target_nid_open, + .write =3D pghot_target_nid_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static void pghot_src_enabled_update(unsigned int enabled) +{ + unsigned int changed =3D pghot_src_enabled ^ enabled; + + if (changed & PGHOT_HINTFAULTS_ENABLED) { + if (enabled & PGHOT_HINTFAULTS_ENABLED) + static_branch_enable(&pghot_src_hintfaults); + else + static_branch_disable(&pghot_src_hintfaults); + } + + if (changed & PGHOT_HWHINTS_ENABLED) { + if (enabled & PGHOT_HWHINTS_ENABLED) + static_branch_enable(&pghot_src_hwhints); + else + static_branch_disable(&pghot_src_hwhints); + } +} + +static ssize_t pghot_src_enabled_write(struct file *filp, const char __use= r *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int enabled; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 0, &enabled)) + return -EINVAL; + + if (enabled & ~PGHOT_SRC_ENABLED_MASK) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_src_enabled_update(enabled); + pghot_src_enabled =3D enabled; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_src_enabled_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%u\n", pghot_src_enabled); + return 0; +} + +static int pghot_src_enabled_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_src_enabled_show, NULL); +} + +static const struct file_operations pghot_src_enabled_fops =3D { + .open =3D pghot_src_enabled_open, + .write =3D pghot_src_enabled_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +void pghot_debug_init(void) +{ + debugfs_pghot =3D debugfs_create_dir("pghot", NULL); + debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL, + &pghot_src_enabled_fops); + debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL, + &pghot_target_nid_fops); + debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL, + &pghot_freq_th_fops); + debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot, + &kmigrated_sleep_ms); + debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot, + &kmigrated_batch_nr); +} diff --git a/mm/pghot.c b/mm/pghot.c new file mode 100644 index 000000000000..dac9e6f3b61e --- /dev/null +++ b/mm/pghot.c @@ -0,0 +1,479 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Maintains information about hot pages from slower tier nodes and + * promotes them. + * + * Per-PFN hotness information is stored for lower tier nodes in + * mem_section. + * + * In the default mode, a single byte (u8) is used to store + * the frequency of access and last access time. Promotions are done + * to a default toptier NID. + * + * A kernel thread named kmigrated is provided to migrate or promote + * the hot pages. kmigrated runs for each lower tier node. It iterates + * over the node's PFNs and migrates pages marked for migration into + * their targeted nodes. + */ +#include +#include +#include +#include +#include + +unsigned int pghot_target_nid =3D PGHOT_DEFAULT_NODE; +unsigned int pghot_src_enabled; +unsigned int pghot_freq_threshold =3D PGHOT_DEFAULT_FREQ_THRESHOLD; +unsigned int kmigrated_sleep_ms =3D KMIGRATED_DEFAULT_SLEEP_MS; +unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BATCH_NR; + +unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; + +DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); +DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); + +#ifdef CONFIG_SYSCTL +static const struct ctl_table pghot_sysctls[] =3D { + { + .procname =3D "pghot_promote_freq_window_ms", + .data =3D &sysctl_pghot_freq_window, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, +}; +#endif + +static bool kmigrated_started __ro_after_init; + +/** + * pghot_record_access() - Record page accesses from lower tier memory + * for the purpose of tracking page hotness and subsequent promotion. + * + * @pfn: PFN of the page + * @nid: Unused + * @src: The identifier of the sub-system that reports the access + * @now: Access time in jiffies + * + * Updates the frequency and time of access and marks the page as + * ready for migration if the frequency crosses a threshold. The pages + * marked for migration are migrated by kmigrated kernel thread. + * + * Return: 0 on success and -EINVAL on failure to record the access. + */ +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now) +{ + struct mem_section *ms; + struct folio *folio; + phi_t *phi, *hot_map; + struct page *page; + + if (!kmigrated_started) + return 0; + + if (!pghot_nid_valid(nid)) + return -EINVAL; + + switch (src) { + case PGHOT_HINTFAULTS: + if (!static_branch_unlikely(&pghot_src_hintfaults)) + return 0; + count_vm_event(PGHOT_RECORDED_HINTFAULTS); + break; + case PGHOT_HWHINTS: + if (!static_branch_unlikely(&pghot_src_hwhints)) + return 0; + count_vm_event(PGHOT_RECORDED_HWHINTS); + break; + default: + return -EINVAL; + } + + /* + * Record only accesses from lower tiers. + */ + if (node_is_toptier(pfn_to_nid(pfn))) + return 0; + + /* + * Reject the non-migratable pages right away. + */ + page =3D pfn_to_online_page(pfn); + if (!page || is_zone_device_page(page)) + return 0; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + return 0; + + if (unlikely(page_folio(page) !=3D folio)) + goto out; + + if (!folio_test_lru(folio)) + goto out; + + /* Get the hotness slot corresponding to the 1st PFN of the folio */ + pfn =3D folio_pfn(folio); + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + goto out; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + count_vm_event(PGHOT_RECORDED_ACCESSES); + + /* + * Update the hotness parameters. + */ + if (pghot_update_record(phi, nid, now)) { + set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map); + set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags); + } +out: + folio_put(folio); + return 0; +} + +static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, + unsigned long *time) +{ + phi_t *phi, *hot_map; + struct mem_section *ms; + + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + return -EINVAL; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + return pghot_get_record(phi, nid, freq, time); +} + +/* + * Walks the PFNs of the zone, isolates and migrates them in batches. + */ +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end= _pfn, + int src_nid) +{ + struct mem_cgroup *cur_memcg =3D NULL; + int cur_nid =3D NUMA_NO_NODE; + LIST_HEAD(migrate_list); + int batch_count =3D 0; + struct folio *folio; + struct page *page; + unsigned long pfn; + + pfn =3D start_pfn; + do { + int nid =3D NUMA_NO_NODE, nr =3D 1; + struct mem_cgroup *memcg; + unsigned long time =3D 0; + int freq =3D 0; + + if (!pfn_valid(pfn)) + goto out_next; + + page =3D pfn_to_online_page(pfn); + if (!page) + goto out_next; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + goto out_next; + + if (unlikely(page_folio(page) !=3D folio)) { + folio_put(folio); + goto out_next; + } + + nr =3D folio_nr_pages(folio); + if (folio_nid(folio) !=3D src_nid) { + folio_put(folio); + goto out_next; + } + + if (!folio_test_lru(folio)) { + folio_put(folio); + goto out_next; + } + + if (pghot_get_hotness(pfn, &nid, &freq, &time)) { + folio_put(folio); + goto out_next; + } + + if (nid =3D=3D NUMA_NO_NODE) + nid =3D pghot_target_nid; + + if (folio_nid(folio) =3D=3D nid) { + folio_put(folio); + goto out_next; + } + + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) { + folio_put(folio); + goto out_next; + } + + memcg =3D folio_memcg(folio); + if (cur_nid =3D=3D NUMA_NO_NODE) { + cur_nid =3D nid; + cur_memcg =3D memcg; + } + + /* If NID or memcg changed, flush the previous batch first */ + if (cur_nid !=3D nid || cur_memcg !=3D memcg) { + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + cur_nid =3D nid; + cur_memcg =3D memcg; + batch_count =3D 0; + cond_resched(); + } + + list_add(&folio->lru, &migrate_list); + folio_put(folio); + + if (++batch_count > kmigrated_batch_nr) { + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + batch_count =3D 0; + cond_resched(); + } +out_next: + pfn +=3D nr; + } while (pfn < end_pfn); + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); +} + +static void kmigrated_do_work(pg_data_t *pgdat) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + start_pfn =3D section_nr_to_pfn(section_nr); + ms =3D __nr_to_section(section_nr); + + if (!pfn_valid(start_pfn)) + continue; + + nid =3D pfn_to_nid(start_pfn); + if (node_is_toptier(nid) || nid !=3D pgdat->node_id) + continue; + + if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot= _map)) + continue; + + kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION, + pgdat->node_id); + } +} + +static inline bool kmigrated_work_requested(pg_data_t *pgdat) +{ + return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); +} + +/* + * Per-node kthread that iterates over its PFNs and migrates the + * pages that have been marked for migration. + */ +static int kmigrated(void *p) +{ + pg_data_t *pgdat =3D p; + + while (!kthread_should_stop()) { + long timeout =3D msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms)); + + if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(p= gdat), + timeout)) + kmigrated_do_work(pgdat); + } + return 0; +} + +static int kmigrated_run(int nid) +{ + pg_data_t *pgdat =3D NODE_DATA(nid); + int ret; + + if (node_is_toptier(nid)) + return 0; + + if (!pgdat->kmigrated) { + pgdat->kmigrated =3D kthread_create_on_node(kmigrated, pgdat, nid, + "kmigrated%d", nid); + if (IS_ERR(pgdat->kmigrated)) { + ret =3D PTR_ERR(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret); + return ret; + } + pr_info("pghot: Started kmigrated thread for node %d\n", nid); + } + wake_up_process(pgdat->kmigrated); + return 0; +} + +static void pghot_free_hot_map(struct mem_section *ms) +{ + kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK)); + ms->hot_map =3D NULL; +} + +static int pghot_alloc_hot_map(struct mem_section *ms, int nid) +{ + ms->hot_map =3D kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KE= RNEL, + nid); + if (!ms->hot_map) + return -ENOMEM; + return 0; +} + +static void pghot_offline_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + unsigned long start, end, pfn; + struct mem_section *ms; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + continue; + + pghot_free_hot_map(ms); + } +} + +static int pghot_online_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + int nid =3D pfn_to_nid(start_pfn); + unsigned long start, end, pfn; + struct mem_section *ms; + int fail =3D 0; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; !fail && pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || ms->hot_map) + continue; + + fail =3D pghot_alloc_hot_map(ms, nid); + } + + if (!fail) + return 0; + + /* rollback */ + end =3D pfn - PAGES_PER_SECTION; + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (ms && ms->hot_map) + pghot_free_hot_map(ms); + } + return -ENOMEM; +} + +static int pghot_memhp_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct memory_notify *mn =3D arg; + int ret =3D 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret =3D pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + case MEM_OFFLINE: + case MEM_CANCEL_ONLINE: + pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + } + + return notifier_from_errno(ret); +} + +static void pghot_destroy_hot_map(void) +{ + unsigned long section_nr, s_begin; + struct mem_section *ms; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + pghot_free_hot_map(ms); + } +} + +static int pghot_setup_hot_map(void) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + start_pfn =3D section_nr_to_pfn(section_nr); + nid =3D pfn_to_nid(start_pfn); + + if (node_is_toptier(nid) || !pfn_valid(start_pfn)) + continue; + + if (pghot_alloc_hot_map(ms, nid)) + goto out_free_hot_map; + } + hotplug_memory_notifier(pghot_memhp_callback, DEFAULT_CALLBACK_PRI); + return 0; + +out_free_hot_map: + pghot_destroy_hot_map(); + return -ENOMEM; +} + +static int __init pghot_init(void) +{ + pg_data_t *pgdat; + int nid, ret; + + ret =3D pghot_setup_hot_map(); + if (ret) + return ret; + + for_each_node_state(nid, N_MEMORY) { + ret =3D kmigrated_run(nid); + if (ret) + goto out_stop_kthread; + } + register_sysctl_init("vm", pghot_sysctls); + pghot_debug_init(); + + kmigrated_started =3D true; + return 0; + +out_stop_kthread: + for_each_node_state(nid, N_MEMORY) { + pgdat =3D NODE_DATA(nid); + if (pgdat->kmigrated) { + kthread_stop(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + } + } + pghot_destroy_hot_map(); + return ret; +} + +late_initcall_sync(pghot_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index 86b14b0f77b5..d3fbe2a5d0e6 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1486,6 +1486,11 @@ const char * const vmstat_text[] =3D { [I(KSTACK_REST)] =3D "kstack_rest", #endif #endif +#ifdef CONFIG_PGHOT + [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", + [I(PGHOT_RECORDED_HINTFAULTS)] =3D "pghot_recorded_hintfaults", + [I(PGHOT_RECORDED_HWHINTS)] =3D "pghot_recorded_hwhints", +#endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1 From nobody Thu Apr 2 14:10:30 2026 Received: from PH0PR06CU001.outbound.protection.outlook.com (mail-westus3azon11011033.outbound.protection.outlook.com [40.107.208.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DADA238839E for ; Mon, 23 Mar 2026 09:52:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.208.33 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259534; cv=fail; b=onzTms3nlHt1e6Fmurc9QrO+DXEIQeKMqOvxmvC9xPPvg833i3AniVaTOpx8+71iD17YBEAw/JNOeN5ABNN/LORwRLF4sbQvQtFn3oRetHv+S1mXRv1Gm8ipT7Jb/cF1K3wcjm46KpYwCdMkYsNSTnk4f3QocHUb3+X7USg2aoU= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259534; c=relaxed/simple; bh=C0JwEL+eKotxU42TtIkFop235TOqGhTeBR8EODY5y9g=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=amEWc93QM9xjq5QYijZU1qyxqvqLMqYzFLpgSfluN2miPz6VWnELIVIBePXLs66GE1yHLtVC/3EVvAcpwAfFpZFkKjmFgHXVF71pquPM9q0o3Ii1GdV6MwtCH/qTsvnNUdBoVO1+EPEZDtSmHvCjz2xboRmFowRT5aR7A/sR6X8= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=ht1q8T5b; arc=fail smtp.client-ip=40.107.208.33 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="ht1q8T5b" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=EFld6Eb36a8GDnkBRGqAw+gqwTK7Vm1o7pUFTksfjj6flrVMbEIgVkxXPtJjAeczEZmVVRHSMy+KagcOZTSvWS09yT8IjUesrv6XQ4wIHKi9fWTamkz5LLByBIpqsCLLn8pFdivViPdoLsoaG1GjoYhnKOnmTJ8AOBO8HfX1e8Gi3kIQZFi428q+3nwJAZ64diVO3L9HOEwLuYBs18aWkIRzvvOgqov8NrACXMxpcBEexAJ0uxQo6jfsPKEM3kcTtWHWD5+SeUkyCcKxQcUWR2NfsNW3dEGJOpfnzUQNKihIvqTAHKAowWdA2tqkwQCqmFmCFLYbAJAg3tm7IvddJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=XjX6f7ME+sRZyfJJIoAT/8vHUdkm8qeEw0Sdf9Rwync=; b=HJQXkvzrQCFIDhuqSpf1I1N+IzcRV//zsaazudqGuYwepFg/y+osnSVxPstZcsqS3NkrZklO8QIPfx+Jmq4XA6qfHllumWFwYwbQPsK+7wVyoICnFZKWuJjzS4tDGQCtaqzYJ/ul9fTArkalTj8yPns+FEKV0rkr3DHvwKZCXxiHz1D91gR4wkQPs4SN3gHBHFzeKIY1pL3Iv++8ZlTWHtjyQrp6Hbc11meqb/KyIdPBl8skvSk1KKwBvpXRJZTziC5DPlTmmMGaJDgf89m9LAexw4L+LaM7g8SG99bGbrbH0dQe2iIZGbjXUebKtk7Ix0A24Mw+xk+AK6ZNj7i2hw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XjX6f7ME+sRZyfJJIoAT/8vHUdkm8qeEw0Sdf9Rwync=; b=ht1q8T5b39D/FbAZqpEhc/9mGhb9qC+8kyALlqUS+vwoTRZQS/xGBajXIv+ABtejVbPacSD0i4XBFOi+lyG1fvluVEmsehnvKpQqfpVgjO2fifntnq2hsOQCTs8bQ+1YJVbZs1alo3Oz15MHWfKMda6vBkJtjCeqXkmXTQuqqHE= Received: from CH0PR03CA0394.namprd03.prod.outlook.com (2603:10b6:610:11b::28) by SJ2PR12MB8955.namprd12.prod.outlook.com (2603:10b6:a03:542::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.9; Mon, 23 Mar 2026 09:52:04 +0000 Received: from CH2PEPF0000009F.namprd02.prod.outlook.com (2603:10b6:610:11b:cafe::8d) by CH0PR03CA0394.outlook.office365.com (2603:10b6:610:11b::28) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:52:04 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009F.mail.protection.outlook.com (10.167.244.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9723.19 via Frontend Transport; Mon, 23 Mar 2026 09:52:04 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:51:56 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Date: Mon, 23 Mar 2026 15:21:03 +0530 Message-ID: <20260323095104.238982-5-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009F:EE_|SJ2PR12MB8955:EE_ X-MS-Office365-Filtering-Correlation-Id: 48b0cb9a-cb46-4b74-0ccf-08de88c1d680 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|36860700016|7416014|376014|82310400026|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: tFZhB1KKUq/J5TpKEs9KWCaC/p4LAnoSoaQgYDpv30QSR36SpxmJQhCNVCIv3bfKS5GPjDILVMwRBu+ZalDmYEUCfb8Qo5SZ+rQROcM2Ht3zZSsaf23bD7WWfoJPlv7Y5RBoSlaoyjM+e593gbOG8ip7XT4XzmsPgkMsk7LUzBb0i5E3tszDmzIEN+4kL3AO9ZW6xaflifSsvI9lo5Z+erSbS41CCdNpXz1uRt69NUHzknQhfx6dPuuaOVxLXKaNY+XlJVp18N8y9Je41UJLvhQeQb/YvaZ3MQaS18tAj1/KKvSFElZ2HDaRhXcRyjdyDQce7yyJfXX5Csw5T/ozs+nkergdhB/gjJKJ3knEyG73YyQFf2spU6h/TSi08cabTPuWiyhQCUjBzDfdXMdl6nrvwQh1i6OWhmAhNzL0Dr9CLdRvu50G9qlefZuV+3QXDsjLqgeRTHPvfeaJmyqpm0/gsm/wrwMgNck/LVsuIJyOG4CBbhVI6M83ViSDe+sQALcMXxXMW1jxZUFJYfJ8lhIxSr+r9/e3JRBTaOua1H+xuzZ6vwqoEycIrQVD3jqWKGcT3QbgZTAxY1lhEQmq3vocLZB9DAv0rmHAqiyQy+hn31M+EgTA8+acJ1+yQsUe6/P5dtzs5QMys8314hAKCGsCNT906wy7bzOAUdVd8QjlpXJ1+uqz0s1mzD7kktFOHq6xxz1ae/YTAiBMf/1gm3u04goKfYxDP8zVM0gs4Phg+kRl0aQMS7xU7I3pmFKn0kxaGq8b4aJ3xnguGsWq7g== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(36860700016)(7416014)(376014)(82310400026)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: Z11kkaZMRFydQx/P2cyAL4m7wQjGOOcdpXFSMcqFDSjAsk9CiZifIUbuKR4fNbTltaNtRh18ek7fF7QsdLqSG9HdyFiTNfc/Tst2xcidT7ka21m9Axfv4mEg2mW0g8u81/D3So+9pOnPssgNrwzOpp8WN1opbvPx+mVgEv+CdTJnSj6WCfegFHvEi2V56bbcPbhei1gSfc3so+c9cXPqeEe5kRAvsKR4PFiBlEjuhWfyedivE9JZGiNRyEtxTBnLbLSpRTgLXKDlcokl4lBus/vdKxeLk+B1fBDuPBhupbA7Ka546IJSqou3jxkGYkz20Njc2mqW2/Lp0etEXj6BVzZleHSkFNq6olBjP1HXYaVEd1pYlMMtMPoDMHIedFJpww19zcQg3i6MtD33wbOrQcoRyt9lyTlVHZ6vqgGAforUN8TmlcefEoaPCn6MjBYW X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:52:04.1348 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 48b0cb9a-cb46-4b74-0ccf-08de88c1d680 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009F.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ2PR12MB8955 Default pghot stores hotness in a 1=E2=80=91byte record per PFN, limiting frequency to 2 bits, time to a 5=E2=80=91bit bucket, and preventing storage of per=E2=80=91PFN toptier NID. This restricts time granularity and forces all promotions to use the global pghot_target_nid. This patch adds an optional precision mode (CONFIG_PGHOT_PRECISE) that expands the hotness record to 4 bytes (u32) and provides: - 10=E2=80=91bit NID field for per=E2=80=91PFN promotion target, - 3=E2=80=91bit frequency field (freq_threshold range 1=E2=80=937), - 14=E2=80=91bit time field offering finer recency tracking, - MSB migrate=E2=80=91ready bit. Precision mode improves placement accuracy on systems with multiple toptier nodes and provides higher=E2=80=91resolution hotness tracking, at the cost of increasing metadata to 4 bytes per PFN. Documentation, tunables, and the record layout are updated accordingly. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 4 +- include/linux/mmzone.h | 2 +- include/linux/pghot.h | 31 ++++++++++ mm/Kconfig | 11 ++++ mm/Makefile | 7 ++- mm/pghot-precise.c | 81 ++++++++++++++++++++++++++ mm/pghot.c | 13 +++-- 7 files changed, 141 insertions(+), 8 deletions(-) create mode 100644 mm/pghot-precise.c diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt index 5f51dd1d4d45..7b84e911afe7 100644 --- a/Documentation/admin-guide/mm/pghot.txt +++ b/Documentation/admin-guide/mm/pghot.txt @@ -37,7 +37,7 @@ Path: /sys/kernel/debug/pghot/ =20 3. **freq_threshold** - Minimum access frequency before a page is marked ready for promotion. - - Range: 1 to 3 + - Range: 1 to 3 in default mode, 1 to 7 in precision mode. - Default: 2 - Example: # echo 3 > /sys/kernel/debug/pghot/freq_threshold @@ -59,7 +59,7 @@ Path: /proc/sys/vm/pghot_promote_freq_window_ms - Controls the time window (in ms) for counting access frequency. A page is considered hot only when **freq_threshold** number of accesses occur with this time period. -- Default: 3000 (3 seconds) +- Default: 3000 (3 seconds) in default mode and 5000 (5s) in precision mod= e. - Example: # sysctl vm.pghot_promote_freq_window_ms=3D3000 =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d7ed60956543..61fd259d9897 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1938,7 +1938,7 @@ struct mem_section { #ifdef CONFIG_PGHOT /* * Per-PFN hotness data for this section. - * Array of phi_t (u8 in default mode). + * Array of phi_t (u8 in default mode, u32 in precision mode). * LSB is used as PGHOT_SECTION_HOT_BIT flag. */ void *hot_map; diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 525d4dd28fc1..2e1742b8caee 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -35,6 +35,36 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); =20 #define PGHOT_DEFAULT_NODE 0 =20 +#if defined(CONFIG_PGHOT_PRECISE) +#define PGHOT_DEFAULT_FREQ_WINDOW (5 * MSEC_PER_SEC) + +/* + * Bits 0-26 are used to store nid, frequency and time. + * Bits 27-30 are unused now. + * Bit 31 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 31 + +#define PGHOT_NID_WIDTH 10 +#define PGHOT_FREQ_WIDTH 3 +/* time is stored in 14 bits which can represent up to 16s with HZ=3D1000 = */ +#define PGHOT_TIME_WIDTH 14 + +#define PGHOT_NID_SHIFT 0 +#define PGHOT_FREQ_SHIFT (PGHOT_NID_SHIFT + PGHOT_NID_WIDTH) +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_NID_MASK GENMASK(PGHOT_NID_WIDTH - 1, 0) +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u32 phi_t; + +#else /* !CONFIG_PGHOT_PRECISE */ #define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC) =20 /* @@ -61,6 +91,7 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); #define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) =20 typedef u8 phi_t; +#endif /* CONFIG_PGHOT_PRECISE */ =20 #define PGHOT_RECORD_SIZE sizeof(phi_t) =20 diff --git a/mm/Kconfig b/mm/Kconfig index 4aeab6aee535..14383bb1d890 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1485,6 +1485,17 @@ config PGHOT This adds 1 byte of metadata overhead per page in lower-tier memory nodes. =20 +config PGHOT_PRECISE + bool "Hot page tracking precision mode" + def_bool n + depends on PGHOT + help + Enables precision mode for tracking hot pages with pghot sub-system. + Adds fine-grained access time tracking and explicit toptier target + NID tracking. Precise hot page tracking comes at the cost of using + 4 bytes per page against the default one byte per page. Preferable + to enable this on systems with multiple nodes in toptier. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 33014de43acc..dc61f4d955f8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -150,4 +150,9 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) +=3D tests/lazy_mmu_mode_kunit.o -obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o +ifdef CONFIG_PGHOT_PRECISE +obj-$(CONFIG_PGHOT) +=3D pghot-precise.o +else +obj-$(CONFIG_PGHOT) +=3D pghot-default.o +endif diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c new file mode 100644 index 000000000000..9e8007adfff9 --- /dev/null +++ b/mm/pghot-precise.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Precision mode + * + * 4 byte hotness record per PFN (u32) + * NID, time and frequency tracked as part of the record. + */ + +#include +#include + +bool pghot_nid_valid(int nid) +{ + /* + * TODO: Add node_online() and node_is_toptier() checks? + */ + if (nid !=3D NUMA_NO_NODE && (nid < 0 || nid >=3D PGHOT_NID_MAX)) + return false; + + return true; +} + +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now & PGHOT_TIME_MASK; + + nid =3D (nid =3D=3D NUMA_NO_NODE) ? pghot_target_nid : nid; + old_hotness =3D READ_ONCE(*phi); + + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT); + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT; + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D (old_hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot.c b/mm/pghot.c index dac9e6f3b61e..7d7ef0800ae2 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -10,6 +10,9 @@ * the frequency of access and last access time. Promotions are done * to a default toptier NID. * + * In the precision mode, 4 bytes are used to store the frequency + * of access, last access time and the accessing NID. + * * A kernel thread named kmigrated is provided to migrate or promote * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into @@ -52,13 +55,15 @@ static bool kmigrated_started __ro_after_init; * for the purpose of tracking page hotness and subsequent promotion. * * @pfn: PFN of the page - * @nid: Unused + * @nid: Target NID to where the page needs to be migrated in precision + * mode but unused in default mode * @src: The identifier of the sub-system that reports the access * @now: Access time in jiffies * - * Updates the frequency and time of access and marks the page as - * ready for migration if the frequency crosses a threshold. The pages - * marked for migration are migrated by kmigrated kernel thread. + * Updates the NID (in precision mode only), frequency and time of access + * and marks the page as ready for migration if the frequency crosses a + * threshold. The pages marked for migration are migrated by kmigrated + * kernel thread. * * Return: 0 on success and -EINVAL on failure to record the access. */ --=20 2.34.1 From nobody Thu Apr 2 14:10:30 2026 Received: from DM1PR04CU001.outbound.protection.outlook.com (mail-centralusazon11010070.outbound.protection.outlook.com [52.101.61.70]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5113383C7C for ; Mon, 23 Mar 2026 09:52:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.61.70 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259543; cv=fail; b=toMzSPyqFksLFdSrcYbeuqT0eiwNyUziR5Di1g8NVlKCM+dGegUwR9u6sABzS0Tp/5w7teLesXLH4YO/p0tVf5LtL7/+PWPTt+sNv0a9VZFhTxJ3V3XCIMHQeB/D+Kk/IN5ERzGtEnHau8v7Ld1lF7bgBdZt5yT1eDLdnNAJRpI= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774259543; c=relaxed/simple; bh=l4WVSdrfNxKtujTph9usew0raDJMduShdF+CISuP1wk=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=cdAe6nTB+r7Iq47adgD1q+P2XEXiFAszXZ4Jny6BOHMLc/DibWnKixsC6uJ6DBivPXYGY4i6+0E47gQ8lZ1NZQGpGhy71Mlu+1XBlCQJEMEknmwuRyWLlQRiEJfWSzZ1+Kp+Jm46LPDmVOFtWzdRjFofHr9DFTcsge0Pf7RI3hA= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Kj+5KOpu; arc=fail smtp.client-ip=52.101.61.70 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Kj+5KOpu" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=KM/SMmNVm07F1XIALQDFzRhL1ICLaFUu6qOx9ZvXfCYMRIlfYjAEoa1EMB8j1xLsWKy6rJXcRZOntPDT7l/LsOU4FP1YV3/l2qvJeXYnjBX7X3vWZ2Gt/oD2dCu5CPapm/bLJwrFKVRjLajT3HJgTLU6jBMdNmmR1lT22eObzKjxAS7oM44wx/IC7kJhysJbrWJxSuyAoGohP748v+ksPIDfe9Im32FD7O7sBIFKyVk5QnSYcRKsMBlUyMFMtMuFziYvJqe/QrhkhX2vzdrjC0DI36n+TOe9TQ1w18IporgXDOIRmY9SLlYeeIWGspEnnJJBVof3WeUzy+VymZtQCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=+Bh1tNTEDyOoLpzM7ddxEjV+T34L4DxEJITFV/hqmJk=; b=mOAwpCwJdJ2tWpYjHEWw1ye3TXX95x0hcmLel4Rw/LToLhnfFuWbDJOffH7sMfls7FZ9dIyKqU47rfyRxy+igemv+rxhXR2jUUW4t7pf0rcSBQH9lKf/7tI+DL+P7G8ljDn0pokJVs6pWf+zSyRxxhZqqk6VCUXSSKbM1U5O6/JbFsl/bbkeVZuw5gowPJaD0KjcRm3yqnrEION7Oz9RZRCE6uDDCCBTNOX6m9Un9qJl8yYlCuDVQEDXgBye6t8NiMPAYZFGkPIfp/TRZQ36WRudK16d67q1fFxSfoPutINUmJqfmWBSzbk3X6SH0gDq0goBpUIS2MbyW4hgq0X8tw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+Bh1tNTEDyOoLpzM7ddxEjV+T34L4DxEJITFV/hqmJk=; b=Kj+5KOpuhTBIW32xdvRZUvNEdV43HrjtHXy3SDNkSxWWFrNTAOOOJmrRehjYdj/GsSyRiTnBEfusI836miJWbenQkstgNvO9gPpVs/+TPt8G0CMn44rylLFJzsNHCYBWfxnaDARUrPerAkz1tUF9cLYQOl5uJjO0hP49Q/EtOWQ= Received: from CH0PR13CA0042.namprd13.prod.outlook.com (2603:10b6:610:b2::17) by IA0PR12MB8088.namprd12.prod.outlook.com (2603:10b6:208:409::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9745.20; Mon, 23 Mar 2026 09:52:12 +0000 Received: from CH2PEPF0000009C.namprd02.prod.outlook.com (2603:10b6:610:b2:cafe::46) by CH0PR13CA0042.outlook.office365.com (2603:10b6:610:b2::17) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9723.31 via Frontend Transport; Mon, 23 Mar 2026 09:52:02 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH2PEPF0000009C.mail.protection.outlook.com (10.167.244.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9700.17 via Frontend Transport; Mon, 23 Mar 2026 09:52:12 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 23 Mar 2026 04:52:04 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Date: Mon, 23 Mar 2026 15:21:04 +0530 Message-ID: <20260323095104.238982-6-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260323095104.238982-1-bharata@amd.com> References: <20260323095104.238982-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH2PEPF0000009C:EE_|IA0PR12MB8088:EE_ X-MS-Office365-Filtering-Correlation-Id: 6b23b909-de75-44e3-b20d-08de88c1db33 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|36860700016|7416014|376014|1800799024|56012099003|18002099003|22082099003; X-Microsoft-Antispam-Message-Info: 8bkE/b3C4ndTtSngGjzqN/f14LIfHaupMl1GUdoMdQbpTzY0BJsMTfmn3LsmiGSIZRQaFg8v7IKVVuve3+YQfMW5IzZdCEK59bnbaYlh0K3uit5Ydnbvd6QeZzftclJeBgAeiz8oLlJQfhh/jFq1YmoIDfYvNwaT8BoMO3IJhEqZKK43baVTplwu8KQrp4LBEc2mS1UnKmOFrWbF4sjRy46oaTO1wq2v4iqmqDF4m+p+mwDFRJy6/oEk6boRi+IzA35dXcLGLb6f9MDpPyX1I+Zqe+KnBWQb/S1vFNagfnjl1mEH67qc0rCAeWoRwonQh6mxy37Kca73G0FKWlKK5YrIwfnmqH/pr2l5K7qz5r7SguRhp/O1KQsZ0YhIWlRlGIwlpO7cGAzrgoImPScSkL6zDWkECeoQBHOxDsw76uxjWJhfC+F9JIIIcoQSd5OeVTt36HGqkfsyvqvQPTcxjeDVrhcJbATXozxPCmntd4qW3jd/LZA4Rs/YuqGc+gpLQT9U8Zxfr9VPR30Z1NGJA7QI/sTZbzysCVud3+q8HxZUPtecLoFQtzyZFWSKrNNxtIqV9clFrjGhivca9i1m7ESvRDE8n1CVdhstDCA4LTIxLkHS9mTln6QWW5TR/UWD/vsgbKhU6+LCUHwCXksWqRcO5TTPQk6qgw9NTW4PIObJLi/IC5z2lllarEERDu5tef/J3rukzrYPzemolMuJztAFLB+b1JNeBRPcNb/2A1wiBdDrrnkyVDCrEbIIWNyfXgTdYovHE2QktnrZufbFiA== X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(36860700016)(7416014)(376014)(1800799024)(56012099003)(18002099003)(22082099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: 7it+ieiw2kTEyHC6AHA/mkXjqTrdeZkpp23+JdayDKU0Q+cBISc9/077ZVLcoKUXQkR1asC1FzC89PSHdL2MEidLAVJpSz/VYdMLhkl/126575s8WPDTGlpKMbhzFWpYbmWgnVKsenCsLkN0RF5xOSMpg8PVCfMVgYZkd3JqXLRVjW69/aZ9wgySrlwaG7im+BsGTzrksVOUuuyyJ0Zhh3y4qhstPXdPO8J0aCb0Q3Hs5oIxDDvbUrcCdWwzWPtmK+kj7b3lDf0AE/VE1qbiVz4RPO0mLN9h1txmUmv7kHmO+fimbFbTtkovRqeAYwuVyHcP0nBUkNbJ2PRHV/VHBDFWTOcxsmGWsvf1xgYnXYUKcSXMtMlqdK0i0+0cHsnHNoGbdYlXlQeq1krWAyDGup5viqM2G465hjv78TNJwo15M6+Tvn2rSTUsH46g7yly X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Mar 2026 09:52:12.0309 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 6b23b909-de75-44e3-b20d-08de88c1db33 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH2PEPF0000009C.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PR12MB8088 Content-Type: text/plain; charset="utf-8" Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING mode of NUMA Balancing) does hot page detection (via hint faults), hot page classification and eventual promotion, all by itself and sits within the scheduler. With pghot, the new hot page tracking and promotion mechanism being available, NUMA Balancing can limit itself to detection of hot pages (via hint faults) and off-load rest of the functionality to pghot. To achieve this, pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the hot page info to pghot. In addition, the migration rate limiting and dynamic threshold logic are moved to kmigrated so that the same can be used for hot pages reported by other sources too. Hence it becomes necessary to introduce a new config option CONFIG_NUMA_BALANCING_TIERING to control the hint faults souce for hot page promotion. This option controls the NUMA_BALANCING_MEMORY_TIERING mode of kernel.numa_balancing This movement of hot page promotion to pghot results in the following changes to the behaviour of hint faults based hot page promotion: 1. Promotion is no longer done in the fault path but instead is deferred to kmigrated and happens in batches. 2. NUMA_BALANCING_MEMORY_TIERING mode used to promote on first access. Pghot by default, promotes on second access though this can be changed by setting /sys/kernel/debug/pghot/freq_threshold. hot_threshold_ms debugfs tunable now gets replaced by pghot's freq_threshold. 3. In NUMA_BALANCING_MEMORY_TIERING mode, hint fault latency is the difference between the PTE update time (during scanning) and the access time (hint fault). However with pghot, a single latency threshold is used for two purposes: a) If the time difference between successive accesses are within the threshold, the page is marked as hot. b) Later when kmigrated picks up the page for migration, it will migrate only if the difference between the current time and the time when the page was marked hot is with the threshold. 4. Batch migration of misplaced folios is done from non-process context where VMA info is not readily available. Without VMA and the exec check on that, it will not be possible to filter out exec pages during migration prep stage. Hence shared executable pages also will be subjected to misplaced migration. 5. The max scan period which is used in dynamic threshold logic was a debugfs tunable. However this has been converted to a scalar metric in pghot. Key code changes due to this movement are detailed below to help easy understanding of the restructuring. 1. Scanning and access times are no longer tracked in last_cpupid field of folio flags. Hence all code related to this (like folio_xchg_access_time(), cpupid_valid()) are removed. 2. The misplaced migration routines become conditional to CONFIG_PGHOT in addition to CONFIG_NUMA_BALANCING. 3. The promotion related stats (like PGPROMOTE_SUCCESS etc) are now moved to under CONFIG_PGHOT as these stats are part of promotion engine which will be used for other hotness sources as well. 4. Routines that are responsibile for migration rate limiting dynamic thresholding, pgdat balancing during promotion etc are moved to pghot with appropriate renaming. Signed-off-by: Bharata B Rao --- include/linux/mm.h | 35 ++------ include/linux/mmzone.h | 4 +- init/Kconfig | 13 +++ kernel/sched/core.c | 7 ++ kernel/sched/debug.c | 1 - kernel/sched/fair.c | 177 ++--------------------------------------- kernel/sched/sched.h | 1 - mm/huge_memory.c | 27 ++++++- mm/memcontrol.c | 6 +- mm/memory-tiers.c | 15 ++-- mm/memory.c | 36 +++++++-- mm/mempolicy.c | 3 - mm/migrate.c | 16 +++- mm/pghot.c | 134 +++++++++++++++++++++++++++++++ mm/vmstat.c | 2 +- 15 files changed, 248 insertions(+), 229 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index abb4963c1f06..81249a06dfeb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1998,17 +1998,6 @@ static inline int folio_nid(const struct folio *foli= o) } =20 #ifdef CONFIG_NUMA_BALANCING -/* page access time bits needs to hold at least 4 seconds */ -#define PAGE_ACCESS_TIME_MIN_BITS 12 -#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS -#define PAGE_ACCESS_TIME_BUCKETS \ - (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT) -#else -#define PAGE_ACCESS_TIME_BUCKETS 0 -#endif - -#define PAGE_ACCESS_TIME_MASK \ - (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS) =20 static inline int cpu_pid_to_cpupid(int cpu, int pid) { @@ -2074,15 +2063,6 @@ static inline void page_cpupid_reset_last(struct pag= e *page) } #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */ =20 -static inline int folio_xchg_access_time(struct folio *folio, int time) -{ - int last_time; - - last_time =3D folio_xchg_last_cpupid(folio, - time >> PAGE_ACCESS_TIME_BUCKETS); - return last_time << PAGE_ACCESS_TIME_BUCKETS; -} - static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) { unsigned int pid_bit; @@ -2093,18 +2073,12 @@ static inline void vma_set_access_pid_bit(struct vm= _area_struct *vma) } } =20 -bool folio_use_access_time(struct folio *folio); #else /* !CONFIG_NUMA_BALANCING */ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid) { return folio_nid(folio); /* XXX */ } =20 -static inline int folio_xchg_access_time(struct folio *folio, int time) -{ - return 0; -} - static inline int folio_last_cpupid(struct folio *folio) { return folio_nid(folio); /* XXX */ @@ -2147,11 +2121,16 @@ static inline bool cpupid_match_pid(struct task_str= uct *task, int cpupid) static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) { } -static inline bool folio_use_access_time(struct folio *folio) +#endif /* CONFIG_NUMA_BALANCING */ + +#ifdef CONFIG_NUMA_BALANCING_TIERING +bool folio_is_promo_candidate(struct folio *folio); +#else +static inline bool folio_is_promo_candidate(struct folio *folio) { return false; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING_TIERING */ =20 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 61fd259d9897..bfaaa757b19c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -232,7 +232,7 @@ enum node_stat_item { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT PGPROMOTE_SUCCESS, /* promote successfully */ /** * Candidate pages for promotion based on hint fault latency. This @@ -1475,7 +1475,7 @@ typedef struct pglist_data { struct deferred_split deferred_split_queue; #endif =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT /* start time in ms of current promote rate limit period */ unsigned int nbp_rl_start; /* number of promote candidate pages at start time of current rate limit = period */ diff --git a/init/Kconfig b/init/Kconfig index 444ce811ea67..56ef148487fa 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1013,6 +1013,19 @@ config NUMA_BALANCING_DEFAULT_ENABLED If set, automatic NUMA balancing will be enabled if running on a NUMA machine. =20 +config NUMA_BALANCING_TIERING + bool "NUMA balancing memory tiering promotion" + depends on NUMA_BALANCING && PGHOT + help + Enable NUMA balancing mode 2 (memory tiering). This allows + automatic promotion of hot pages from slower memory tiers to + faster tiers using the pghot subsystem. + + This requires CONFIG_PGHOT for the hot page tracking engine. + This option is required for kernel.numa_balancing=3D2. + + If unsure, say N. + config SLAB_OBJ_EXT bool =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 496dff740dca..f8ca5dff9cad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4463,6 +4463,7 @@ void set_numabalancing_state(bool enabled) } =20 #ifdef CONFIG_PROC_SYSCTL +#ifdef CONFIG_NUMA_BALANCING_TIERING static void reset_memory_tiering(void) { struct pglist_data *pgdat; @@ -4473,6 +4474,7 @@ static void reset_memory_tiering(void) pgdat->nbp_th_start =3D jiffies_to_msecs(jiffies); } } +#endif =20 static int sysctl_numa_balancing(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -4490,9 +4492,14 @@ static int sysctl_numa_balancing(const struct ctl_ta= ble *table, int write, if (err < 0) return err; if (write) { + if ((state & NUMA_BALANCING_MEMORY_TIERING) && + !IS_ENABLED(CONFIG_NUMA_BALANCING_TIERING)) + return -EOPNOTSUPP; +#ifdef CONFIG_NUMA_BALANCING_TIERING if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && (state & NUMA_BALANCING_MEMORY_TIERING)) reset_memory_tiering(); +#endif sysctl_numa_balancing_mode =3D state; __set_numabalancing_state(state); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index b24f40f05019..c6a3325ebbd2 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -622,7 +622,6 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_min); debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_max); debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_sca= n_size); - debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bf948db905ed..131fc4bb1fa7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu) static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ -static unsigned int sysctl_numa_balancing_promote_rate_limit =3D 65536; -#endif - #ifdef CONFIG_SYSCTL static const struct ctl_table sched_fair_sysctls[] =3D { #ifdef CONFIG_CFS_BANDWIDTH @@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] =3D= { .extra1 =3D SYSCTL_ONE, }, #endif -#ifdef CONFIG_NUMA_BALANCING - { - .procname =3D "numa_balancing_promote_rate_limit_MBps", - .data =3D &sysctl_numa_balancing_promote_rate_limit, - .maxlen =3D sizeof(unsigned int), - .mode =3D 0644, - .proc_handler =3D proc_dointvec_minmax, - .extra1 =3D SYSCTL_ZERO, - }, -#endif /* CONFIG_NUMA_BALANCING */ }; =20 static int __init sched_fair_sysctl_init(void) @@ -1519,9 +1504,6 @@ unsigned int sysctl_numa_balancing_scan_size =3D 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in m= s */ unsigned int sysctl_numa_balancing_scan_delay =3D 1000; =20 -/* The page with hint page fault latency < threshold in ms is considered h= ot */ -unsigned int sysctl_numa_balancing_hot_threshold =3D MSEC_PER_SEC; - struct numa_group { refcount_t refcount; =20 @@ -1864,120 +1846,6 @@ static inline unsigned long group_weight(struct tas= k_struct *p, int nid, return 1000 * faults / total_faults; } =20 -/* - * If memory tiering mode is enabled, cpupid of slow memory page is - * used to record scan time instead of CPU and PID. When tiering mode - * is disabled at run time, the scan time (in cpupid) will be - * interpreted as CPU and PID. So CPU needs to be checked to avoid to - * access out of array bound. - */ -static inline bool cpupid_valid(int cpupid) -{ - return cpupid_to_cpu(cpupid) < nr_cpu_ids; -} - -/* - * For memory tiering mode, if there are enough free pages (more than - * enough watermark defined here) in fast memory node, to take full - * advantage of fast memory capacity, all recently accessed slow - * memory pages will be migrated to fast memory node without - * considering hot threshold. - */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) -{ - int z; - unsigned long enough_wmark; - - enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); - for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { - struct zone *zone =3D pgdat->node_zones + z; - - if (!populated_zone(zone)) - continue; - - if (zone_watermark_ok(zone, 0, - promo_wmark_pages(zone) + enough_wmark, - ZONE_MOVABLE, 0)) - return true; - } - return false; -} - -/* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, - * - * hint page fault latency =3D hint page fault time - scan time - * - * The smaller the hint page fault latency, the higher the possibility - * for the page to be hot. - */ -static int numa_hint_fault_latency(struct folio *folio) -{ - int last_time, time; - - time =3D jiffies_to_msecs(jiffies); - last_time =3D folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; -} - -/* - * For memory tiering mode, too high promotion/demotion throughput may - * hurt application latency. So we provide a mechanism to rate limit - * the number of pages that are tried to be promoted. - */ -static bool numa_promotion_rate_limit(struct pglist_data *pgdat, - unsigned long rate_limit, int nr) -{ - unsigned long nr_cand; - unsigned int now, start; - - now =3D jiffies_to_msecs(jiffies); - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - start =3D pgdat->nbp_rl_start; - if (now - start > MSEC_PER_SEC && - cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) - pgdat->nbp_rl_nr_cand =3D nr_cand; - if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) - return true; - return false; -} - -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now =3D jiffies_to_msecs(jiffies); - th_period =3D sysctl_numa_balancing_scan_period_max; - start =3D pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { - ref_cand =3D rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; - unit_th =3D ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th =3D pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th =3D max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th =3D min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand =3D nr_cand; - pgdat->nbp_threshold =3D th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -1993,41 +1861,15 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct folio *folio, =20 /* * The pages in slow memory node should be migrated according - * to hot/cold instead of private/shared. - */ - if (folio_use_access_time(folio)) { - struct pglist_data *pgdat; - unsigned long rate_limit; - unsigned int latency, th, def_th; - long nr =3D folio_nr_pages(folio); - - pgdat =3D NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { - /* workload changed, reset hot threshold */ - pgdat->nbp_threshold =3D 0; - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr); - return true; - } - - def_th =3D sysctl_numa_balancing_hot_threshold; - rate_limit =3D MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th =3D pgdat->nbp_threshold ? : def_th; - latency =3D numa_hint_fault_latency(folio); - if (latency >=3D th) - return false; - - return !numa_promotion_rate_limit(pgdat, rate_limit, nr); - } + * to hot/cold instead of private/shared. Also the migration + * of such pages are handled by kmigrated. + */ + if (folio_is_promo_candidate(folio)) + return true; =20 this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D folio_xchg_last_cpupid(folio, this_cpupid); =20 - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && - !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid)) - return false; - /* * Allow first faults or private faults to migrate immediately early in * the lifetime of a task. The magic number 4 is based on waiting for @@ -3237,15 +3079,6 @@ void task_numa_fault(int last_cpupid, int mem_node, = int pages, int flags) if (!p->mm) return; =20 - /* - * NUMA faults statistics are unnecessary for the slow memory - * node for memory tiering mode. - */ - if (!node_is_toptier(mem_node) && - (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING || - !cpupid_valid(last_cpupid))) - return; - /* Allocate buffer to track faults on a per-node basis */ if (unlikely(!p->numa_faults)) { int size =3D sizeof(*p->numa_faults) * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 43bbf0693cca..a47f7e3d51a6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3021,7 +3021,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; -extern unsigned int sysctl_numa_balancing_hot_threshold; =20 #ifdef CONFIG_SCHED_HRTICK =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b298cba853ab..fe957ff91df9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -2190,7 +2191,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) int nid =3D NUMA_NO_NODE; int target_nid, last_cpupid; pmd_t pmd, old_pmd; - bool writable =3D false; + bool writable =3D false, needs_promotion =3D false; int flags =3D 0; =20 vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) goto out_map; =20 nid =3D folio_nid(folio); + needs_promotion =3D folio_is_promo_candidate(folio); =20 target_nid =3D numa_migrate_check(folio, vmf, haddr, &flags, writable, &last_cpupid); if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; + + if (needs_promotion) { + /* + * Hot page promotion, mode=3DNUMA_BALANCING_MEMORY_TIERING. + * Isolation and migration are handled by pghot. + * + * TODO: mode2 check + */ + writable =3D false; + nid =3D target_nid; + goto out_map; + } + + /* Balancing b/n toptier nodes, mode=3DNUMA_BALANCING_NORMAL */ if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { flags |=3D TNF_MIGRATE_FAIL; goto out_map; @@ -2253,8 +2269,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + if (nid !=3D NUMA_NO_NODE) { + if (needs_promotion) + pghot_record_access(folio_pfn(folio), nid, + PGHOT_HINTFAULTS, jiffies); + else + task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + } return 0; } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 772bac21d155..fcd92f2ffd0c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -323,7 +323,7 @@ static const unsigned int memcg_node_stat_items[] =3D { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT PGPROMOTE_SUCCESS, #endif PGDEMOTE_KSWAPD, @@ -1400,7 +1400,7 @@ static const struct memory_stat memory_stats[] =3D { { "pgdemote_direct", PGDEMOTE_DIRECT }, { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, { "pgdemote_proactive", PGDEMOTE_PROACTIVE }, -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif }; @@ -1443,7 +1443,7 @@ static int memcg_page_state_output_unit(int item) case PGDEMOTE_DIRECT: case PGDEMOTE_KHUGEPAGED: case PGDEMOTE_PROACTIVE: -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT case PGPROMOTE_SUCCESS: #endif return 1; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 986f809376eb..7303dc10035c 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -51,18 +51,19 @@ static const struct bus_type memory_tier_subsys =3D { .dev_name =3D "memory_tier", }; =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_NUMA_BALANCING_TIERING /** - * folio_use_access_time - check if a folio reuses cpupid for page access = time + * folio_is_promo_candidate - check if the folio qualifies for promotion + * * @folio: folio to check * - * folio's _last_cpupid field is repurposed by memory tiering. In memory - * tiering mode, cpupid of slow memory folio (not toptier memory) is used = to - * record page access time. + * Checks if NUMA Balancing tiering mode is set and the folio belongs + * to lower tier. If so, it qualifies for promotion to toptier when + * it is categorized as hot. * - * Return: the folio _last_cpupid is used to record page access time + * Return: True if the above condition is met, else False. */ -bool folio_use_access_time(struct folio *folio) +bool folio_is_promo_candidate(struct folio *folio) { return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)); diff --git a/mm/memory.c b/mm/memory.c index 2f815a34d924..289fa6c07a42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -75,6 +75,7 @@ #include #include #include +#include #include #include #include @@ -5968,10 +5969,9 @@ int numa_migrate_check(struct folio *folio, struct v= m_fault *vmf, if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) *flags |=3D TNF_SHARED; /* - * For memory tiering mode, cpupid of slow memory page is used - * to record page access time. So use default value. + * For memory tiering mode, last_cpupid is unused. So use default value. */ - if (folio_use_access_time(folio)) + if (folio_is_promo_candidate(folio)) *last_cpupid =3D (-1 & LAST_CPUPID_MASK); else *last_cpupid =3D folio_last_cpupid(folio); @@ -6052,6 +6052,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) int nid =3D NUMA_NO_NODE; bool writable =3D false, ignore_writable =3D false; bool pte_write_upgrade =3D vma_wants_manual_pte_write_upgrade(vma); + bool needs_promotion =3D false; int last_cpupid; int target_nid; pte_t pte, old_pte; @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) goto out_map; =20 nid =3D folio_nid(folio); + needs_promotion =3D folio_is_promo_candidate(folio); nr_pages =3D folio_nr_pages(folio); =20 target_nid =3D numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { + + if (needs_promotion) { + /* + * Hot page promotion, mode=3DNUMA_BALANCING_MEMORY_TIERING. + * Isolation and migration are handled by pghot. + */ + writable =3D false; + ignore_writable =3D true; + nid =3D target_nid; + goto out_map; + } + + /* Balancing b/n toptier nodes, mode=3DNUMA_BALANCING_NORMAL */ + if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) { flags |=3D TNF_MIGRATE_FAIL; goto out_map; } + /* The folio is isolated and isolation code holds a folio reference. */ pte_unmap_unlock(vmf->pte, vmf->ptl); writable =3D false; @@ -6110,7 +6126,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } =20 flags |=3D TNF_MIGRATE_FAIL; - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->pte =3D pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (unlikely(!vmf->pte)) return 0; @@ -6118,6 +6134,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } + out_map: /* * Make it present again, depending on how arch implements @@ -6131,8 +6148,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) writable); pte_unmap_unlock(vmf->pte, vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, nr_pages, flags); + if (nid !=3D NUMA_NO_NODE) { + if (needs_promotion) + pghot_record_access(folio_pfn(folio), nid, + PGHOT_HINTFAULTS, jiffies); + else + task_numa_fault(last_cpupid, nid, nr_pages, flags); + } return 0; } =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0e5175f1c767..6eed217a5917 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -866,9 +866,6 @@ bool folio_can_map_prot_numa(struct folio *folio, struc= t vm_area_struct *vma, node_is_toptier(nid)) return false; =20 - if (folio_use_access_time(folio)) - folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); - return true; } =20 diff --git a/mm/migrate.c b/mm/migrate.c index a5f48984ed3e..db6832b4b95b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2690,8 +2690,18 @@ int migrate_misplaced_folio_prepare(struct folio *fo= lio, if (!migrate_balanced_pgdat(pgdat, nr_pages)) { int z; =20 - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)) + /* + * Kswapd wakeup for creating headroom in toptier is done only + * for hot page promotion case and not for misplaced migrations + * between toptier nodes. + * + * In the uncommon case of using NUMA_BALANCING_NORMAL mode + * to balance between lower and higher tier nodes, we end up + * up waking up kswapd. + */ + if (node_is_toptier(folio_nid(folio))) return -EAGAIN; + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { if (managed_zone(pgdat->node_zones + z)) break; @@ -2741,6 +2751,8 @@ int migrate_misplaced_folio(struct folio *folio, int = node) #ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); +#endif +#ifdef CONFIG_NUMA_BALANCING_TIERING if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) { @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head *= folio_list, int node) #ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); +#endif +#ifdef CONFIG_PGHOT mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded); #endif } diff --git a/mm/pghot.c b/mm/pghot.c index 7d7ef0800ae2..3c0ba254ad4c 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -17,6 +17,9 @@ * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into * their targeted nodes. + * + * Migration rate-limiting and dynamic threshold logic implementations + * were moved from NUMA Balancing mode 2. */ #include #include @@ -32,6 +35,12 @@ unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BA= TCH_NR; =20 unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; =20 +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +static unsigned int sysctl_pghot_promote_rate_limit =3D 65536; + +#define KMIGRATED_MIGRATION_ADJUST_STEPS 16 +#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW 60000 + DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); =20 @@ -45,6 +54,22 @@ static const struct ctl_table pghot_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ZERO, }, + { + .procname =3D "pghot_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, + { + .procname =3D "numa_balancing_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, }; #endif =20 @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, i= nt src, unsigned long now) return 0; } =20 +/* + * For memory tiering mode, if there are enough free pages (more than + * enough watermark defined here) in fast memory node, to take full + * advantage of fast memory capacity, all recently accessed slow + * memory pages will be migrated to fast memory node without + * considering hot threshold. + */ +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long enough_wmark; + + enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { + struct zone *zone =3D pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + promo_wmark_pages(zone) + enough_wmark, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsi= gned long rate_limit, + int nr, unsigned long now_ms) +{ + unsigned long nr_cand; + unsigned int start; + + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start =3D pgdat->nbp_rl_start; + if (now_ms - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now_ms) =3D=3D start) + pgdat->nbp_rl_nr_cand =3D nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) + return true; + return false; +} + +static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, unsigned int ref_th, + unsigned long now_ms) +{ + unsigned int start, th_period, unit_th, th; + unsigned long nr_cand, ref_cand, diff_cand; + + th_period =3D KMIGRATED_PROMOTION_THRESHOLD_WINDOW; + start =3D pgdat->nbp_th_start; + if (now_ms - start > th_period && + cmpxchg(&pgdat->nbp_th_start, start, now_ms) =3D=3D start) { + ref_cand =3D rate_limit * + KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC; + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; + unit_th =3D ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS; + th =3D pgdat->nbp_threshold ? : ref_th; + if (diff_cand > ref_cand * 11 / 10) + th =3D max(th - unit_th, unit_th); + else if (diff_cand < ref_cand * 9 / 10) + th =3D min(th + unit_th, ref_th * 2); + pgdat->nbp_th_nr_cand =3D nr_cand; + pgdat->nbp_threshold =3D th; + } +} + +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int ni= d, + unsigned long time) +{ + struct pglist_data *pgdat; + unsigned long rate_limit; + unsigned int th, def_th; + unsigned long now_ms =3D jiffies_to_msecs(jiffies); /* Based on full-widt= h jiffies */ + unsigned long now =3D jiffies; + + pgdat =3D NODE_DATA(nid); + if (pgdat_free_space_enough(pgdat)) { + /* workload changed, reset hot threshold */ + pgdat->nbp_threshold =3D 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages); + return true; + } + + def_th =3D sysctl_pghot_freq_window; + rate_limit =3D MB_TO_PAGES(sysctl_pghot_promote_rate_limit); + kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms); + + th =3D pgdat->nbp_threshold ? : def_th; + if (pghot_access_latency(time, now) >=3D th) + return false; + + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_m= s); +} + static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, unsigned long *time) { @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pf= n, unsigned long end_pfn, goto out_next; } =20 + if (!kmigrated_should_migrate_memory(nr, nid, time)) { + folio_put(folio); + goto out_next; + } + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) { folio_put(folio); goto out_next; diff --git a/mm/vmstat.c b/mm/vmstat.c index d3fbe2a5d0e6..f28f786f8931 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1267,7 +1267,7 @@ const char * const vmstat_text[] =3D { #ifdef CONFIG_SWAP [I(NR_SWAPCACHE)] =3D "nr_swapcached", #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT [I(PGPROMOTE_SUCCESS)] =3D "pgpromote_success", [I(PGPROMOTE_CANDIDATE)] =3D "pgpromote_candidate", [I(PGPROMOTE_CANDIDATE_NRL)] =3D "pgpromote_candidate_nrl", --=20 2.34.1