From nobody Fri Jun 12 12:43:35 2026 Received: from CO1PR03CU002.outbound.protection.outlook.com (mail-westus2azon11010060.outbound.protection.outlook.com [52.101.46.60]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85B8A285417 for ; Mon, 4 May 2026 06:10:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.46.60 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875020; cv=fail; b=MeJRW4lStGe64N4814qAhTCXMKeoXNamCR8PbFZnyw7rSGrkVXlnWJWP514jjxsiAkUevrvl4O/Qzc/z+1+xEwoNQQx9U4zXzFBhj4RMnA+sJ4xPq6AsiuspzLHG6lIvEL3i+YgZxID9kblj3/3yC5Om2HDWRPKyWdDvsbJTacM= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875020; c=relaxed/simple; bh=67Zx/lH+4mKA93csT5mfn7zfkaQMFsCyFavmDMtziCI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pc9LO2WbmM/l/ESkqMAR5gMeBHtRnSww3+xTKp51YG9tnSzk+T2GX05mJAAp4oJO3qgRvNxfeBYQH6cquI7vUpayPTeYwTR7rYWbK2DNWBQfDEVc7IV1gCLwstzurWuu3twVNuGSfX5NZP0Dtm0dnLpbx6pUAng/0rQ+Se0fEWs= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=3YrOVRG2; arc=fail smtp.client-ip=52.101.46.60 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="3YrOVRG2" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=URsaHr7fMnADFOh95TVD/jg1MiWtV4xPJSvM3oOc1OoUdlG+W8MNf8jb4dErduEssoCLklICE/ztr2nqj290RgK+RUXU5h3GAz+pduzeFcoHlqhjWHhg+p0fIObfoiNzfqrrux9dItwP4//h0/7OsmvU2fn1JiL/SI97xmhDtUG4kJwEBpELnjFJTTzYaOPtuZKdEq9HNQ78Q3s4PX7Ic86sp1Qh4HGG7XsS2laaB3ia0n6Nj8o64qsBXFPzm/aRlp+mI3w2DxErc4uc2988+7JVvwWmBBjg0Wjd130yTzl257vvSigIBaUe78EJydjdVncgztc/95hsgalwIFtMfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=MuP6eHNGW13vhdeIzbrmzG4xZNW/gBXT3DNV0u2vbOg=; b=elm0x50h++DFD+mf/yqN13N/Ixg4GCSgx3LWXhLw3cRmqk4MN01tm4K16QcF+E2+zPt6iBTaiDXbTCjJmyPcYXtPVhy9Sh3G1jfOqpYVoCs4OJJef+hm+vn+0SnMWm7e71sj7y1W1xtiNlJPZGGYDmTEOwYAOIZ4w0Z+viPgVMe84Ld50Z1L25ipzyEdLQicwag7IKvnheWC/8DmF9Dk01v/QyRTIG79LNh2en3uFgseE7oVghUZKUfP/pvVV7UHyJw+vimO55mA6XNbyTRbh5ycNj82ETriC8gSnbDpSX/5iwApHR6/nDjf8alEkVFTxBh7uf3k+xUAFk/4PzbeFg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=MuP6eHNGW13vhdeIzbrmzG4xZNW/gBXT3DNV0u2vbOg=; b=3YrOVRG2qEqbOk0/ueHsThwLq4MpdwBcZOD691AJ4m2aZnTfYFO0cKyftyZ3410//b+A3L+Ev759jix1NkpyBPcnepJ6sx05OKTe9BBUNgOQ7wfSoXuMcbbRWXvfrQ3WtckMfQ6OqDThEVm8JRrxXW2im8aIYalsNV6FtK7OGfY= Received: from CH0PR07CA0016.namprd07.prod.outlook.com (2603:10b6:610:32::21) by DS0PR12MB6535.namprd12.prod.outlook.com (2603:10b6:8:c0::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:10:09 +0000 Received: from CH1PEPF0000AD77.namprd04.prod.outlook.com (2603:10b6:610:32:cafe::1f) by CH0PR07CA0016.outlook.office365.com (2603:10b6:610:32::21) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:10:08 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD77.mail.protection.outlook.com (10.167.244.55) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:08 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:00 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Date: Mon, 4 May 2026 11:39:18 +0530 Message-ID: <20260504060924.344313-2-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD77:EE_|DS0PR12MB6535:EE_ X-MS-Office365-Filtering-Correlation-Id: 600d5ac1-310d-40b1-c57e-08dea9a3cb59 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|82310400026|7416014|376014|36860700016|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: 4o4gflH2pAYamJoZ0nFbHNBNaMlywdtrV0CRJge/tX3F0p7TzCiNgZFSALaNCIXpjwh3ggpxglKTiy99aEsqxXrTIlb0Ce8xx/rEu2/PdJuNud2FqQSAQtxoqfnvaPOv2LFljSZ+2DFhT4ELRbp//jzwHPm1t2xRQgri22CQUFpauI7OigVCO7wULy7xi3EhX4qOyRE93O/UeWNWewaZmdJkz5Dsmnq8h1+7FjZmoRbo4n7otBBchhiqM88kEFIggecZdUZSNd6i+iKr4mKHA5wh9TyznJ+9kNdK/4y1rmsbFvQV5yTWj9G26U0FaphqYs3SG1gHXFcklz1iGqOgREaac2QRvQorsDImbZWED7SaAKGuihBdL9TdcHrclr22XGEgjkzVmuKo2GAE+XXglSwhdXA0LhflUmQ1qILuwrGqUgxyV56TFm38rM3kfQfpii34Dj+dAtsz7nBhhN391npaGKSAy6BMCPuMZN2qQNPznVPGPjtv1/31rCJJNh/M7ilJUc4XREhjcO4rROIUbPzn5biKBNUywtiEh0A8VPRcVX7NcdhET7ZJfBhk1nmY7n1+j3FcZoNAMI7KulU0mKo4Hv0EgSL0GUvQJp2aXG/3B+pGfPi+Y6/vycXkBvyMhqvmAf4GZPfGlwmwCdq6L2lR12dWVJdfWCIObeFO1ALCk1lXZJPPsw2Ccb0Lal2/6uPZyjqKfTKQ8GJeOl75wNMD0Ot4sZXhNa7dhHvwtbc= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(82310400026)(7416014)(376014)(36860700016)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: CneetJYJoRCx/G2YLrq0li3qmYJCRvPlYvj9WeiRjFdBMro9mYDQ20RngNcwZ6CmmuAe4XMnXJAMYlnMxCgNglIz4bHvaTgPTO+e8lkDjqzPhWB7s0GDJqYFKDTYZL7C+BckPmGswC7rocddnINbbOgiNR89+s3drKZS6q13iWnbZvu/9AoxtDT1Z1cEbwN87z496kOTzscvbAZYiYKhjpV0OCKdy6dt9dZLM2NGLnYYASJZPyxPpv1UVZlzmNl4SrEENFDv64jCebvxf5fu3EWt99lUZiabp9AqD5Bi3NgdKcC+ROSb1CtA0BqCtMbJxN/LOt67f7jZBRrZBZreMdBS9wocgq1TtNOohy+fRvhtsRdMxkTDbGVqAEuwD0T8oeEaSusvLhpZ5IUEzsIai6AEFBsxPgC8yiP6ZjC4yJzPLDJOfB2r/TvcWjl0AYff X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:08.8997 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 600d5ac1-310d-40b1-c57e-08dea9a3cb59 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD77.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR12MB6535 Content-Type: text/plain; charset="utf-8" We want isolation of misplaced folios to work in contexts where VMA isn't available, typically when performing migrations from a kernel thread context. In order to prepare for that, allow migrate_misplaced_folio_prepare() to be called with a NULL VMA. When migrate_misplaced_folio_prepare() is called with non-NULL VMA, it will check if the folio is mapped shared and that requires holding PTL lock. This path isn't taken when the function is invoked with NULL VMA (migration outside of process context). Therefore, when VMA =3D=3D NULL, migrate_misplaced_folio_prepare() does not require the caller to hold the PTL. Signed-off-by: Bharata B Rao --- mm/migrate.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 8a64291ab5b4..eb21a02fade0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2671,7 +2671,12 @@ static struct folio *alloc_misplaced_dst_folio(struc= t folio *src, =20 /* * Prepare for calling migrate_misplaced_folio() by isolating the folio if - * permitted. Must be called with the PTL still held. + * permitted. Must be called with the PTL still held if called with a non-= NULL + * vma. + * + * When called with a NULL vma (e.g., kernel thread initiated migration), + * migrate_misplaced_folio_prepare() will allow shared executable folios + * to be migrated. */ int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -2688,7 +2693,7 @@ int migrate_misplaced_folio_prepare(struct folio *fol= io, * See folio_maybe_mapped_shared() on possible imprecision * when we cannot easily detect if a folio is shared. */ - if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) + if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) return -EACCES; =20 /* --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from SN4PR2101CU001.outbound.protection.outlook.com (mail-southcentralusazon11012046.outbound.protection.outlook.com [40.93.195.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9047246778 for ; Mon, 4 May 2026 06:10:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.93.195.46 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875024; cv=fail; b=VH5aVi2Wn16JjxQIrJrUHDuMrWNSWxP0VTZ/WRU2TseoZzp0yQuNT2hCnx89KEk8X5kIHdKzVzc0VcwzW+j+BQd9j1DScZL7aVcjH0SQYFkusIwtrkL0xsyVhwnSpQpelOzlOdMjUHbDem0uknj2bv/FUHVOo43wjLkLFNG3/Kg= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875024; c=relaxed/simple; bh=X0S87YPJkElh+K9TJjIQty6TBFxRqmlTT5Cu2DXc1z0=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=G1/qXHidBytHbilkK1mWiJ/61bH8ECX4LQiMvAY09NfL9bedoBOqYOfkkm+tJn8VPqUpwkqNmGi1tsS0ZfWytE1jVBzjTU0lfgdTQh6XHIkWnVTkjq6917NE6Fv+OBu0VgASjrs/e9x/9iCCYV0nPTP4BFMP1q1lCrD8LoTsV3g= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=0rmesWSn; arc=fail smtp.client-ip=40.93.195.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="0rmesWSn" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=yzXuSidyHeY1ClcQDPqUpalZAAzRmoN4k5OzcOh/JpOJ30G8kKuukFP4VvVSwZLtaVwX23TN780tWO3EmGAanSlgXyqrtXO4ktokljY/0O0gj4qkcvDGM8YzRanxHQx8rmgjp8/hwUiVlfggmM2Iajixyv9AxrzReJctf4J50D3XCm0yppOF28ryEWzAvv9jt09oNWRWec6408TftkVYic26nZXHa0t33319qdLFbEg+SPEcV+4aHUxUWvbMxLJfi2+VjLRw8zUbuWzVlhu95Axw/CrBmwxJXQInufhQPROJ+ByH6D4Rh/JEOF1ImGjnj+FSBh945q9LlyPDVHXp8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mUI21Zifl1UAAP7nrPJamaThQ27hIf9osVGTGa13JOI=; b=SSX3gMOs14CLJ1Nbg7Lga1xSVcdGMxT+BiquMcw54GHVucIwpgV5MKfgf6TuQbUKwr2r/VuEkT8Vt2ZhvuoxKNbDaw7UlLutqgL6L+jAyGkut4ePk8Flnfzsz/1tD4JQbW4ewcFNTeW/jvqn8rOeUQUP3LfaelcT1RntcbED2TV/QPaXDVBUFMCkkn/Sw8N2GFWLMtKC9A50JIxXUKmJ4JRwJGohMK6XHDcJyBHMEAvcYq64NindQ0kSzCqDSkrHw7oiRWSeSl52EiTiYl88j7QqW9kBqoPRrcgiU0zYeVdJ68DZJ/81KPEWpkjVPnpgE+wUWq7mMh/PY18bgH8hbg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=mUI21Zifl1UAAP7nrPJamaThQ27hIf9osVGTGa13JOI=; b=0rmesWSnV8TSgT4H2BNy5rar2R66C/53BehiYGeX4bWs0N0h9hUA7XxcmWYFLz5Hq9ilGVr5rCtJlftivd7ZQ5xvyLuK9kDQ5brB2Y1mb+bZslaaV8e4l3cVOdyuTUo/ufkrdtehDLiDFMtEd/hvqeaCk0AaevPyP3Ajlex8GFM= Received: from CH2PR12CA0008.namprd12.prod.outlook.com (2603:10b6:610:57::18) by IA0PPF64A94D5DF.namprd12.prod.outlook.com (2603:10b6:20f:fc04::bd0) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9846.26; Mon, 4 May 2026 06:10:17 +0000 Received: from CH1PEPF0000AD75.namprd04.prod.outlook.com (2603:10b6:610:57:cafe::65) by CH2PR12CA0008.outlook.office365.com (2603:10b6:610:57::18) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:10:17 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD75.mail.protection.outlook.com (10.167.244.54) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:16 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:08 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Date: Mon, 4 May 2026 11:39:19 +0530 Message-ID: <20260504060924.344313-3-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD75:EE_|IA0PPF64A94D5DF:EE_ X-MS-Office365-Filtering-Correlation-Id: dcbc2e91-a113-47e9-c88a-08dea9a3d028 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|36860700016|1800799024|82310400026|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: bZoubtCICSFcbgsNZ89h12ZODMPlZzVH+2POEPdZzynxMKH+zFrrp/FbCNssEaA5A8bbRBy1H1OFkx+dlwHIw/aMTlKPcjsPKK0ESOZsecsV62hQ/XMQrcBIiQDVKt/A2GtaqzIoAoeSYGL8KvCPC27/LXXCv/hEOj+4uCVUAVcKmqAQUEKKF85osBZ137u3cjcyktmrekxIqzxSFwtEVS1G8JcpT5YlzFkH6eO+/KLj2nshBH0NALFc3yaYmI1sk3W8iyQuawxNzy9oAazZhT+8UZbWpOkVCLToQgbSWlcSyorJsMwUcZ4y5Fn2FRnCcjycunClUr1ATgLbO7G13BS6h2h0E6pVDvoeNVQ60ojoHaAISEOipQjSTcv3d9iUjkcjd1jJokYZp+kWT78GBm2ee7PswA7X3m2ZDTr56zY8WihrVvzVz+/GYXG3rdAKUNhAZelKPl9GI2qtx4We/uD6SRipVzev2ZfQlROK0od2L/xU/l0KmDNtBRe1A8U1p2PFkuQyZMLrY0DwD+N1BJ88Y5W+0wpt/joaj5H5L+fbiDf69ESs5EE7zMbS1klzaZH/XqYIkokKsmWomUIzhroMxBka4xeeFXXUZA2sbkZEYqzNgZc7L7EK2+obDSPcl7XPUxxDfS8CgBvGMNP4prjiMqlJ9LiFiCR2pKiq+OxCmP3/4kZy9MYi29tkAT94xbeoMFOqCA3/zfnGKz484k5F0QSOKDyP+Rj+29Sbzmk= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(376014)(36860700016)(1800799024)(82310400026)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: m3gboqEHfqlI4pWTleXvRSZgoDIBV0wHJoWftDeD4pkFSQh3vHswp885XUy6LDPnUtCe2HLUQIC26meNu0nlgCQGti8snIdzq016RrYe1YCOttrAvaAM7Yg0N23PLnWO5OukJt2lBwCHeCNqZYv740XPTHUe5UGBEoZ1KEEYiZXUmF3hTQZu8JcXdxTil0SaNQ3WCjKQqBVf6IQEL73tvb0wzLn9R6pQCDPAjezhCXVqCZJvrfZAkVJOLoSMExD5Qti/SG+bvHsk3EtHLesySxCpTNaZ80rRCg0LIlGpgy8tIKQOQvrzUA3WIw3q2d9CfukDBh0WK9IhQf9s4W3xFqRYdOLxtN49sz/VpdtbDE3TthSYEnSxkdoX9QGYqe0oxUV69we5i1b5WRVX6jmYRCvbgG64PMXSjkf3Uzt3J+ZliB9HtX6ofHW/7QxRsTzW X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:16.9686 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: dcbc2e91-a113-47e9-c88a-08dea9a3d028 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD75.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PPF64A94D5DF Content-Type: text/plain; charset="utf-8" From: Gregory Price Tiered memory systems often require migrating multiple folios at once. Currently, migrate_misplaced_folio() handles only one folio per call, which is inefficient for batch operations. This patch introduces promote_misplaced_memcg_folios(), a batch variant that leverages migrate_pages() internally for improved performance. The caller must isolate folios beforehand using migrate_misplaced_folio_prepare(). Additionally all the folios in the isolated list must belong to the same memcg. On return, the folio list will be empty regardless of success or failure. This function will be used by pghot kmigrated thread. Signed-off-by: Gregory Price [Rewrote commit description, memcg awareness] Signed-off-by: Bharata B Rao --- include/linux/migrate.h | 5 ++++ mm/migrate.c | 57 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 62 insertions(+) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index d5af2b7f577b..d136612eef9d 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -111,6 +111,7 @@ static inline void softleaf_entry_wait_on_locked(softle= af_t entry, spinlock_t *p int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node); #else static inline int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -121,6 +122,10 @@ static inline int migrate_misplaced_folio(struct folio= *folio, int node) { return -EAGAIN; /* can't migrate now */ } +static inline int promote_misplaced_memcg_folios(struct list_head *folio_l= ist, int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_MIGRATION diff --git a/mm/migrate.c b/mm/migrate.c index eb21a02fade0..747277aadf19 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2770,4 +2770,61 @@ int migrate_misplaced_folio(struct folio *folio, int= node) BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } + +/** + * promote_misplaced_memcg_folios() - Batch variant of migrate_misplaced_f= olio + * Attempts to promote a folio list to the specified destination. + * @folio_list: Isolated list of folios to be batch-promoted. + * @node: The NUMA node ID to where the folios should be promoted. + * + * Caller is expected to have isolated the folios by calling + * migrate_misplaced_folio_prepare(), which will result in an + * elevated reference count on the folios. All the isolated folios + * in the list must belong to the same memcg so that NUMA_PAGE_MIGRATE + * stat can be attributed correctly to the memcg. + * + * This function will un-isolate the folios, drop the elevated reference + * and remove them from the list before returning. This should be called + * only for batched promotion of hot pages from lower tier nodes. + * + * Return: 0 on success and -EAGAIN on failure or partial promotion. + * On return, @folio_list will be empty regardless of success/fail= ure. + */ +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node) +{ + struct mem_cgroup *memcg =3D NULL; + unsigned int nr_succeeded =3D 0; + struct folio *first; + int nr_remaining; + + if (list_empty(folio_list)) + return 0; + + first =3D list_first_entry(folio_list, struct folio, lru); +#ifdef CONFIG_DEBUG_VM + { + struct folio *f; + list_for_each_entry(f, folio_list, lru) + VM_WARN_ON_ONCE(folio_memcg(f) !=3D folio_memcg(first)); + } +#endif + memcg =3D get_mem_cgroup_from_folio(first); + + nr_remaining =3D migrate_pages(folio_list, alloc_misplaced_dst_folio, + NULL, node, MIGRATE_ASYNC, + MR_NUMA_MISPLACED, &nr_succeeded); + if (nr_remaining) + putback_movable_pages(folio_list); + + if (nr_succeeded) { + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); + mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)), + PGPROMOTE_SUCCESS, nr_succeeded); + } + + mem_cgroup_put(memcg); + WARN_ON(!list_empty(folio_list)); + return nr_remaining ? -EAGAIN : 0; +} #endif /* CONFIG_NUMA_BALANCING */ --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from BYAPR05CU005.outbound.protection.outlook.com (mail-westusazon11010016.outbound.protection.outlook.com [52.101.85.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5188822CBD9 for ; Mon, 4 May 2026 06:10:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.85.16 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875035; cv=fail; b=eJxs8iUJjMInGD7bCaWhEAuXDrOrOpGzP1J9ZQaa1myHDjIgc+2sf3ADUV+SqObq9zw5VqLjro6M/hIqhpYh5AyAf2ZIkGVm7FbkCyECKQ5nyRuZ/mcRduzJ2HVcte25LaaKvMRVfnhFui4AdqKNQalzgPRQaq5j4IPkTsUZRxM= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875035; c=relaxed/simple; bh=2V4Zjg1fRrGqQp1Nf5SyeR9BWd6M0epWtzwJICpAeAQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=N+YlM5bvJF5Dm9PCaXkr9EGdqVdPUjzcyGrRz3ExWUEcTEdWpU2ScE+vg0oC4knrdRrV1cU3/2aCNHDCKJfmv+6nE4uUV4ldHrovEQ4A1KMfo66/3/wbYCJr+uDCs+FwK7g7gCnfuu7Szfnj5YzKueNddUOaqYU+uFuPzzz6DJA= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=ChNJA2YZ; arc=fail smtp.client-ip=52.101.85.16 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="ChNJA2YZ" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=kZz3r4bpqoNH9udQvUsrqcsJ4edg/3S0eB7su9qTPWK2Cb/Vs+Tzhfyi0ouO36VFbKjmn11vYiigDdTnHg7+Cp1YiuG/LNuEPc0F2Sl81EA1uaS5ROHYWs99/pI1BkQ8tBE8nTbEVjLZY0bpk2xEtyUfCwmmMnkKV3VVhIJHv/kttoSULfE1Xvd9OT9GnjpTGymlaFiltfb3ASwPCh3lZjeIHHcm1Bp5CpxhbGflw2ETPUL+QcLRIQqFLVbjYCP94hO+ZSgayj5+jEueWyHSVSR/sY4nFljKDwVPzHGoEb50uLCgMcXPSGrNW+SWHKVI0dzsht63sBrHavZUMRTCAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5tNMZd2VCrP6CmTnqsRr858/zhsfVrxWE9QK5Aizcj8=; b=VWYua5UJraFAKW7MPs8jU8OVJx0EThK7LpwOjimBxceOv64WhHIWSc5M/5dJtorxVQ8nCJUYFYn8zrEqz679tCV564IXNXrMRjLNbIBJ8n7TcgngGnTbSVxhx3rRKhS7WkX6pI8oNt/hYoecPEXRW2KU8m6M+MPL1nweolA+NXhkZcUYsGYF7eJj3zwTyAm2S72lYqKnvjSXK/N2nLceLn+Zi7lOmbfxMSJVqBk2+rruGHN4CkC9oizcJpESFHPUQXA0ZGLudH+uI3p/dc8i721Yibkl4B7yCCTaG7hG8poNRF4i/4rZOOI/t1Azf9+dmwZUHSJ8TStc161zDcjL9Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5tNMZd2VCrP6CmTnqsRr858/zhsfVrxWE9QK5Aizcj8=; b=ChNJA2YZmzssF3Y1SKu0SuLCXSKGbHjhQvPFSM39AvMRSgtomcAnu2++aDJD7J+HdPc3esL2Hi3FxfPkiXvEKM2+1OsHxmucuSz8R6pSXJm/vU5FGRKXyG0pIGqKF1iNAPnoiISPePQxbDSNbK/mfoH60el1uUai+5QLjFHtBBs= Received: from CH5P222CA0016.NAMP222.PROD.OUTLOOK.COM (2603:10b6:610:1ee::23) by IA1PR12MB7662.namprd12.prod.outlook.com (2603:10b6:208:425::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:10:27 +0000 Received: from CH1PEPF0000AD7B.namprd04.prod.outlook.com (2603:10b6:610:1ee:cafe::3a) by CH5P222CA0016.outlook.office365.com (2603:10b6:610:1ee::23) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:10:27 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD7B.mail.protection.outlook.com (10.167.244.58) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:27 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:16 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Date: Mon, 4 May 2026 11:39:20 +0530 Message-ID: <20260504060924.344313-4-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD7B:EE_|IA1PR12MB7662:EE_ X-MS-Office365-Filtering-Correlation-Id: c18fad01-4db5-4cbf-069c-08dea9a3d6a1 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|36860700016|1800799024|376014|7416014|56012099003|22082099003|18002099003; X-Microsoft-Antispam-Message-Info: UKn5NvTx2nJAwNzTVO17PunZtkc1SYd0mIsGSuO+PZJaLxHuB/CiQQJ5r/Dr2xNThbIsOJh9O2WAFivsN/aXcpIO+tFQIC4Xy/Z+PjxXIYpTTOp7T5mp62Ojvfw/34oMzuydqtljgN8FRp5MqbEX6qYbn5pczgJm+UiKPF1O8C5n7Lrct2hzpviQcUeLJnaXdEpUTMXakR5mSYbHdn/h3JsZNkzv5YZu5TXqSAxhxAxeNV1LFxJ1j3QpVO1PffK/kVdhtjWzst6WlZIwMBXH9ZOTYeYKMkFM317CVluV+6PAdd94g18f2p0YgTzYyrK+lGIDoJfT2qnBRzD2CLlkIx5zl8lH3LNZgDt7kDOFy3of/mncI4l79Ruh+W/H4Qn+skp6F8tbgkRMVeB5tVDVtAAjzy2aBqWvUxV4JxjyTj419y/EcmdUA/W1XP7n5OA0tYtG7sgVINpHrQU1Y8cca/irnTzcEirpZO/KbiaCJtC9AnL6+PHAuabf4e4uIrtC3AMwG+9WiRMJVhAl9t8MkzQu/kABjgITQL0MWPzOUGMPQL/ciqGaH0D7G3lwIq8589w6DYUlf8DP/Z1+JpXiwUNBb4Dv8/C3jKa+vpVdYR1tU7lknoZgX58OcxRob+q3SsajllpoEfmKMKc176rU9dOEHn/m6KL75g37gXQHJmVN38tGtdK/q6BaR2nMFf7reUrAIcM5TIZY4um3NyOaWDSnBcSt31BGS7TyM5d2dy4= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(36860700016)(1800799024)(376014)(7416014)(56012099003)(22082099003)(18002099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: WKmBpiRz5gnD3UuC1xKtz5BtRWqyltDXEzzlnym/M9Mf6h2htKqLTBSTNaksVqHuJuDNC2Qyjn0cQThGkTsDDK7+Os3N44ZuuhA2hh0sFAQd7mxbRiA7RNa6nOMnU1LMeECZKyaNR9qSnOhcHlRGtEiy1KrTDeJFCCu7+UHZJXwRkMZ9GBx6DknCoL8orMXe6hjwX2pWsfyBCTHSUO69sM8Yx7TtCzRHDu96kRw0M89cm2a0+Ih6TC2lxXkDi1OMJkHKgkEB03obARfvjRlMdvo3mKX7exfAQv2s38ixQnopVwGCxsBGlAw7mwEaQ80K1NNHKOILZhkPzjTyYq/P7gFiJtq5uD7ZPQS4FOFpIS8AsQAman7uN7xYiP1Cs5QIFwDlEivkMVHax/Iklnlgpz9pqEGHjFJQlT2XYYZ81BfhdFWWR2FaGZ9elXiP4LQn X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:27.8497 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: c18fad01-4db5-4cbf-069c-08dea9a3d6a1 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD7B.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB7662 pghot is a subsystem that collects memory access information from multiple sources, classifies hot pages resident in lower-tier memory, and promotes them to faster tiers. It stores per-PFN hotness metadata and performs asynchronous, batched promotion via a per-lower-tier-node kernel thread (kmigrated). This change introduces the default (compact) mode of pghot: - Per-PFN hotness record (phi_t =3D u8) embedded via mem_section: - 2 bits: access frequency (4 levels) - 5 bits: time bucket (=E2=89=884s window with HZ=3D1000, bucketed jiffie= s) - 1 bit : migration-ready flag (MSB) The LSB of mem_section->hot_map pointer is used as a per-section "hot" flag to gate scanning. - Event recording API: int pghot_record_access(unsigned long pfn, int nid, int src, unsigned lon= g now) @pfn: The PFN of the memory accessed @nid: The accessing NUMA node ID @src: The temperature source (subsystem) that generated the access info @time: The access time in jiffies - Sources (e.g., NUMA hint faults, HW hints) call this to report accesses. - In default mode, the nid is not stored/used for targeting; promotion goes to a configurable toptier node (pghot_target_nid). - Promotion engine: - One kmigrated thread per lower-tier node. - Scans only sections whose "hot" flag was raised, iterates PFNs, and batches candidates by destination node. - Uses migrate_misplaced_folios_batch() to move batched folios. - Tunables & stats: - debugfs: enabled_sources, target_nid, freq_threshold, kmigrated_sleep_ms, kmigrated_batch_nr - sysctl : vm.pghot_promote_freq_window_ms - vmstat : pghot_recorded_accesses, pghot_recorded_hintfaults, pghot_recorded_hwhints Memory overhead --------------- Default mode uses 1 byte of hotness metadata per PFN on lower-tier nodes. Behavior & policy ----------------- - Default mode promotion target: The nid passed by sources is not stored; hot pages promote to pghot_target_nid (toptier). Precision mode (added later in the series) changes this. - Record consumption: kmigrated consumes (clears) the "migration-ready" bit before attempting isolation. Additionally the hotness record is reset. If isolation/migration fails, the folio is not re-queued automatically; subsequent accesses will re-arm it. This avoids retry storms and keeps batching stable. - Wakeups: kmigrated wakeups are intentionally timeout-driven. We set the per-pgdat "activate" flag on access, and kmigrated checks this flag on its next sleep interval. This keeps the first cut simple and avoids potential wake storms; active wakeups can be considered in a follow-up. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/index.rst | 1 + Documentation/admin-guide/mm/pghot.rst | 80 ++++ include/linux/migrate.h | 4 +- include/linux/mmzone.h | 20 + include/linux/pghot.h | 82 ++++ include/linux/vm_event_item.h | 5 + mm/Kconfig | 14 + mm/Makefile | 1 + mm/migrate.c | 16 +- mm/mm_init.c | 10 + mm/pghot-default.c | 79 ++++ mm/pghot-tunables.c | 182 +++++++++ mm/pghot.c | 494 +++++++++++++++++++++++++ mm/vmstat.c | 5 + 14 files changed, 986 insertions(+), 7 deletions(-) create mode 100644 Documentation/admin-guide/mm/pghot.rst create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot-default.c create mode 100644 mm/pghot-tunables.c create mode 100644 mm/pghot.c diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-g= uide/mm/index.rst index bbb563cba5d2..4d6810b02365 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -43,3 +43,4 @@ the Linux memory management. userfaultfd zswap kho + pghot diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-g= uide/mm/pghot.rst new file mode 100644 index 000000000000..5f51dd1d4d45 --- /dev/null +++ b/Documentation/admin-guide/mm/pghot.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D +PGHOT: Hot Page Tracking Tunables +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D +The PGHOT subsystem tracks frequently accessed pages in lower-tier memory = and +promotes them to faster tiers. It uses per-PFN hotness metadata and asynch= ronous +migration via per-node kernel threads (kmigrated). + +This document describes tunables available via **debugfs** and **sysctl** = for +PGHOT. + +Debugfs Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Path: /sys/kernel/debug/pghot/ + +1. **enabled_sources** + - Bitmask to enable/disable hotness sources. + - Bits: + - 0: Hint faults (value 0x1) + - 1: Hardware hints (value 0x2) + - Default: 0 (disabled) + - Example: + # echo 0x3 > /sys/kernel/debug/pghot/enabled_sources + Enables all sources. + +2. **target_nid** + - Toptier NUMA node ID to which hot pages should be promoted when source + does not provide nid. Used when hotness source can't provide accessing + NID or when the tracking mode is default. + - Default: 0 + - Example: + # echo 1 > /sys/kernel/debug/pghot/target_nid + +3. **freq_threshold** + - Minimum access frequency before a page is marked ready for promotion. + - Range: 1 to 3 + - Default: 2 + - Example: + # echo 3 > /sys/kernel/debug/pghot/freq_threshold + +4. **kmigrated_sleep_ms** + - Sleep interval (ms) for kmigrated thread between scans. + - Default: 100 + +5. **kmigrated_batch_nr** + - Maximum number of folios migrated in one batch. + - Default: 512 + +Sysctl Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +1. pghot_promote_freq_window_ms + +Path: /proc/sys/vm/pghot_promote_freq_window_ms + +- Controls the time window (in ms) for counting access frequency. A page is + considered hot only when **freq_threshold** number of accesses occur with + this time period. +- Default: 3000 (3 seconds) +- Example: + # sysctl vm.pghot_promote_freq_window_ms=3D3000 + +Vmstat Counters +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Following vmstat counters provide some stats about pghot subsystem. + +Path: /proc/vmstat + +1. **pghot_recorded_accesses** + - Number of total hot page accesses recorded by pghot. + +2. **pghot_recorded_hintfaults** + - Number of recorded accesses reported by NUMA Balancing based + hotness source. + +3. **pghot_recorded_hwhints** + - Number of recorded accesses reported by hwhints source. diff --git a/include/linux/migrate.h b/include/linux/migrate.h index d136612eef9d..53bae80d11ae 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -107,7 +107,7 @@ static inline void softleaf_entry_wait_on_locked(softle= af_t entry, spinlock_t *p =20 #endif /* CONFIG_MIGRATION */ =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); @@ -126,7 +126,7 @@ static inline int promote_misplaced_memcg_folios(struct= list_head *folio_list, i { return -EAGAIN; /* can't migrate now */ } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ =20 #ifdef CONFIG_MIGRATION =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9adb2ad21da5..eb08431dc9fb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1155,6 +1155,7 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_KMIGRATED_ACTIVATE, /* activates kmigrated */ }; =20 enum zone_flags { @@ -1609,6 +1610,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif +#ifdef CONFIG_PGHOT + struct task_struct *kmigrated; + wait_queue_head_t kmigrated_wait; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) @@ -2019,12 +2024,27 @@ struct mem_section { unsigned long section_mem_map; =20 struct mem_section_usage *usage; +#ifdef CONFIG_PGHOT + /* + * Per-PFN hotness data for this section. + * Array of phi_t (u8 in default mode). + * LSB is used as PGHOT_SECTION_HOT_BIT flag. + */ + void *hot_map; +#endif #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; +#endif + /* + * Padding to maintain consistent mem_section size when exactly + * one of PGHOT or PAGE_EXTENSION is enabled. This ensures + * optimal alignment regardless of configuration. + */ +#if (defined(CONFIG_PGHOT) ^ defined(CONFIG_PAGE_EXTENSION)) unsigned long pad; #endif /* diff --git a/include/linux/pghot.h b/include/linux/pghot.h new file mode 100644 index 000000000000..525d4dd28fc1 --- /dev/null +++ b/include/linux/pghot.h @@ -0,0 +1,82 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PGHOT_H +#define _LINUX_PGHOT_H + +/* Page hotness temperature sources */ +enum pghot_src { + PGHOT_HINTFAULTS =3D 0, + PGHOT_HWHINTS, + PGHOT_SRC_MAX +}; + +#ifdef CONFIG_PGHOT +#include + +extern unsigned int pghot_target_nid; +extern unsigned int pghot_src_enabled; +extern unsigned int pghot_freq_threshold; +extern unsigned int kmigrated_sleep_ms; +extern unsigned int kmigrated_batch_nr; +extern unsigned int sysctl_pghot_freq_window; + +void pghot_debug_init(void); + +DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults); +DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); + +#define PGHOT_HINTFAULTS_ENABLED BIT(PGHOT_HINTFAULTS) +#define PGHOT_HWHINTS_ENABLED BIT(PGHOT_HWHINTS) +#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_SRC_MAX - 1, 0) + +#define PGHOT_DEFAULT_FREQ_THRESHOLD 2 + +#define KMIGRATED_DEFAULT_SLEEP_MS 100 +#define KMIGRATED_DEFAULT_BATCH_NR 512 + +#define PGHOT_DEFAULT_NODE 0 + +#define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC) + +/* + * Bits 0-6 are used to store frequency and time. + * Bit 7 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 7 + +#define PGHOT_FREQ_WIDTH 2 +/* Bucketed time is stored in 5 bits which can represent up to 3.9s with H= Z=3D1000 */ +#define PGHOT_TIME_BUCKETS_SHIFT 7 +#define PGHOT_TIME_WIDTH 5 +#define PGHOT_NID_WIDTH 10 + +#define PGHOT_FREQ_SHIFT 0 +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) +#define PGHOT_TIME_BUCKETS_MASK (PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_SH= IFT) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u8 phi_t; + +#define PGHOT_RECORD_SIZE sizeof(phi_t) + +#define PGHOT_SECTION_HOT_BIT 0 +#define PGHOT_SECTION_HOT_MASK BIT(PGHOT_SECTION_HOT_BIT) + +bool pghot_nid_valid(int nid); +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime); +bool pghot_update_record(phi_t *phi, int nid, unsigned long now); +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time); + +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now); +#else +static inline int pghot_record_access(unsigned long pfn, int nid, int src,= unsigned long now) +{ + return 0; +} +#endif /* CONFIG_PGHOT */ +#endif /* _LINUX_PGHOT_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..58d510711bd4 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -175,6 +175,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSTACK_REST, #endif #endif /* CONFIG_DEBUG_STACK_USAGE */ +#ifdef CONFIG_PGHOT + PGHOT_RECORDED_ACCESSES, + PGHOT_RECORDED_HINTFAULTS, + PGHOT_RECORDED_HWHINTS, +#endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/Kconfig b/mm/Kconfig index 0a43bb80df4f..ebfa149d8123 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1469,6 +1469,20 @@ config LAZY_MMU_MODE_KUNIT_TEST =20 If unsure, say N. =20 +config PGHOT + bool "Hot page tracking and promotion" + default n + depends on NUMA_MIGRATION && SPARSEMEM + help + A sub-system to track page accesses in lower tier memory and + maintain hot page information. Promotes hot pages from lower + tiers to top tier by using the memory access information provided + by various sources. Asynchronous promotion is done by per-node + kernel threads. + + This adds 1 byte of metadata overhead per page in lower-tier + memory nodes. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..33014de43acc 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -150,3 +150,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) +=3D tests/lazy_mmu_mode_kunit.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o diff --git a/mm/migrate.c b/mm/migrate.c index 747277aadf19..726d27b61a46 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2625,7 +2625,7 @@ SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long= , nr_pages, } #endif /* CONFIG_NUMA_MIGRATION */ =20 -#ifdef CONFIG_NUMA_BALANCING +#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_PGHOT) /* * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which is crude. @@ -2745,12 +2745,10 @@ int migrate_misplaced_folio_prepare(struct folio *f= olio, */ int migrate_misplaced_folio(struct folio *folio, int node) { - pg_data_t *pgdat =3D NODE_DATA(node); int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); struct mem_cgroup *memcg =3D get_mem_cgroup_from_folio(folio); - struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); =20 list_add(&folio->lru, &migratepages); nr_remaining =3D migrate_pages(&migratepages, alloc_misplaced_dst_folio, @@ -2759,12 +2757,18 @@ int migrate_misplaced_folio(struct folio *folio, in= t node) if (nr_remaining && !list_empty(&migratepages)) putback_movable_pages(&migratepages); if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) - && node_is_toptier(node)) + && node_is_toptier(node)) { + pg_data_t *pgdat =3D NODE_DATA(node); + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); + mod_lruvec_state(lruvec, PGPROMOTE_SUCCESS, nr_succeeded); + } +#endif } mem_cgroup_put(memcg); BUG_ON(!list_empty(&migratepages)); @@ -2817,14 +2821,16 @@ int promote_misplaced_memcg_folios(struct list_head= *folio_list, int node) putback_movable_pages(folio_list); =20 if (nr_succeeded) { +#ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)), PGPROMOTE_SUCCESS, nr_succeeded); +#endif } =20 mem_cgroup_put(memcg); WARN_ON(!list_empty(folio_list)); return nr_remaining ? -EAGAIN : 0; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING || CONFIG_PGHOT */ diff --git a/mm/mm_init.c b/mm/mm_init.c index f9f8e1af921c..2396c42028ae 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1384,6 +1384,15 @@ static void pgdat_init_kcompactd(struct pglist_data = *pgdat) static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} #endif =20 +#ifdef CONFIG_PGHOT +static void pgdat_init_kmigrated(struct pglist_data *pgdat) +{ + init_waitqueue_head(&pgdat->kmigrated_wait); +} +#else +static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {} +#endif + static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { int i; @@ -1393,6 +1402,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + pgdat_init_kmigrated(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/pghot-default.c b/mm/pghot-default.c new file mode 100644 index 000000000000..e610062345e4 --- /dev/null +++ b/mm/pghot-default.c @@ -0,0 +1,79 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Default mode + * + * 1 byte hotness record per PFN. + * Bucketed time and frequency tracked as part of the record. + * Promotion to @pghot_target_nid by default. + */ + +#include +#include + +/* pghot-default doesn't store and hence no NID validation is required */ +bool pghot_nid_valid(int nid) +{ + return true; +} + +/* + * @time is regular time, @old_time is bucketed time. + */ +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + time &=3D PGHOT_TIME_BUCKETS_MASK; + old_time <<=3D PGHOT_TIME_BUCKETS_SHIFT; + + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now >> PGHOT_TIME_BUCKETS_SHIFT; + + old_hotness =3D READ_ONCE(*phi); + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D pghot_target_nid; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c new file mode 100644 index 000000000000..f04e2137309e --- /dev/null +++ b/mm/pghot-tunables.c @@ -0,0 +1,182 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot tunables in debugfs + */ +#include +#include +#include + +static struct dentry *debugfs_pghot; +static DEFINE_MUTEX(pghot_tunables_lock); + +static ssize_t pghot_freq_th_write(struct file *filp, const char __user *u= buf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int freq; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &freq)) + return -EINVAL; + + if (!freq || freq > PGHOT_FREQ_MAX) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_freq_threshold =3D freq; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_freq_th_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_freq_threshold); + return 0; +} + +static int pghot_freq_th_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_freq_th_show, NULL); +} + +static const struct file_operations pghot_freq_th_fops =3D { + .open =3D pghot_freq_th_open, + .write =3D pghot_freq_th_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static ssize_t pghot_target_nid_write(struct file *filp, const char __user= *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int nid; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &nid)) + return -EINVAL; + + if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid)) + return -EINVAL; + mutex_lock(&pghot_tunables_lock); + pghot_target_nid =3D nid; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_target_nid_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_target_nid); + return 0; +} + +static int pghot_target_nid_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_target_nid_show, NULL); +} + +static const struct file_operations pghot_target_nid_fops =3D { + .open =3D pghot_target_nid_open, + .write =3D pghot_target_nid_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static void pghot_src_enabled_update(unsigned int enabled) +{ + unsigned int changed =3D pghot_src_enabled ^ enabled; + + if (changed & PGHOT_HINTFAULTS_ENABLED) { + if (enabled & PGHOT_HINTFAULTS_ENABLED) + static_branch_enable(&pghot_src_hintfaults); + else + static_branch_disable(&pghot_src_hintfaults); + } + + if (changed & PGHOT_HWHINTS_ENABLED) { + if (enabled & PGHOT_HWHINTS_ENABLED) + static_branch_enable(&pghot_src_hwhints); + else + static_branch_disable(&pghot_src_hwhints); + } +} + +static ssize_t pghot_src_enabled_write(struct file *filp, const char __use= r *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int enabled; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 0, &enabled)) + return -EINVAL; + + if (enabled & ~PGHOT_SRC_ENABLED_MASK) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_src_enabled_update(enabled); + pghot_src_enabled =3D enabled; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_src_enabled_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%u\n", pghot_src_enabled); + return 0; +} + +static int pghot_src_enabled_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_src_enabled_show, NULL); +} + +static const struct file_operations pghot_src_enabled_fops =3D { + .open =3D pghot_src_enabled_open, + .write =3D pghot_src_enabled_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +void pghot_debug_init(void) +{ + debugfs_pghot =3D debugfs_create_dir("pghot", NULL); + debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL, + &pghot_src_enabled_fops); + debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL, + &pghot_target_nid_fops); + debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL, + &pghot_freq_th_fops); + debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot, + &kmigrated_sleep_ms); + debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot, + &kmigrated_batch_nr); +} diff --git a/mm/pghot.c b/mm/pghot.c new file mode 100644 index 000000000000..02e6959b647a --- /dev/null +++ b/mm/pghot.c @@ -0,0 +1,494 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Maintains information about hot pages from slower tier nodes and + * promotes them. + * + * Per-PFN hotness information is stored for lower tier nodes in + * mem_section. + * + * In the default mode, a single byte (u8) is used to store + * the frequency of access and last access time. Promotions are done + * to a default toptier NID. + * + * A kernel thread named kmigrated is provided to migrate or promote + * the hot pages. kmigrated runs for each lower tier node. It iterates + * over the node's PFNs and migrates pages marked for migration into + * their targeted nodes. + */ +#include +#include +#include +#include +#include + +unsigned int pghot_target_nid =3D PGHOT_DEFAULT_NODE; +unsigned int pghot_src_enabled; +unsigned int pghot_freq_threshold =3D PGHOT_DEFAULT_FREQ_THRESHOLD; +unsigned int kmigrated_sleep_ms =3D KMIGRATED_DEFAULT_SLEEP_MS; +unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BATCH_NR; + +unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; + +DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); +DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); + +#ifdef CONFIG_SYSCTL +static const struct ctl_table pghot_sysctls[] =3D { + { + .procname =3D "pghot_promote_freq_window_ms", + .data =3D &sysctl_pghot_freq_window, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, +}; +#endif + +static bool kmigrated_started __ro_after_init; + +/** + * pghot_record_access() - Record page accesses from lower tier memory + * for the purpose of tracking page hotness and subsequent promotion. + * + * @pfn: PFN of the page + * @nid: Unused + * @src: The identifier of the sub-system that reports the access + * @now: Access time in jiffies + * + * Updates the frequency and time of access and marks the page as + * ready for migration if the frequency crosses a threshold. The pages + * marked for migration are migrated by kmigrated kernel thread. + * + * Return: 0 on success and -EINVAL on failure to record the access. + */ +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now) +{ + struct mem_section *ms; + struct folio *folio; + phi_t *phi, *hot_map; + struct page *page; + int src_nid; + + if (!kmigrated_started) + return 0; + + if (!pghot_nid_valid(nid)) + return -EINVAL; + + switch (src) { + case PGHOT_HINTFAULTS: + if (!static_branch_unlikely(&pghot_src_hintfaults)) + return 0; + count_vm_event(PGHOT_RECORDED_HINTFAULTS); + break; + case PGHOT_HWHINTS: + if (!static_branch_unlikely(&pghot_src_hwhints)) + return 0; + count_vm_event(PGHOT_RECORDED_HWHINTS); + break; + default: + return -EINVAL; + } + + src_nid =3D pfn_to_nid(pfn); + if (src_nid =3D=3D nid) + return 0; + + /* + * Record only accesses from lower tiers. + */ + if (node_is_toptier(src_nid)) + return 0; + + /* + * Reject the non-migratable pages right away. + */ + page =3D pfn_to_online_page(pfn); + if (!page || is_zone_device_page(page)) + return 0; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + return 0; + + if (unlikely(page_folio(page) !=3D folio)) + goto out; + + if (!folio_test_lru(folio)) + goto out; + + /* Get the hotness slot corresponding to the 1st PFN of the folio */ + pfn =3D folio_pfn(folio); + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + goto out; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + count_vm_event(PGHOT_RECORDED_ACCESSES); + + /* + * Update the hotness parameters. + */ + if (pghot_update_record(phi, nid, now)) { + set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map); + set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags); + } +out: + folio_put(folio); + return 0; +} + +static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, + unsigned long *time) +{ + phi_t *phi, *hot_map; + struct mem_section *ms; + + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + return -EINVAL; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + return pghot_get_record(phi, nid, freq, time); +} + +/* + * Walks the PFNs of the zone, isolates and migrates them in batches. + */ +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end= _pfn, + int src_nid) +{ + struct mem_cgroup *cur_memcg =3D NULL; + int cur_nid =3D NUMA_NO_NODE; + LIST_HEAD(migrate_list); + int batch_count =3D 0; + struct folio *folio; + struct page *page; + unsigned long pfn; + + pfn =3D start_pfn; + do { + int nid =3D NUMA_NO_NODE, nr =3D 1; + struct mem_cgroup *memcg; + unsigned long time =3D 0; + int freq =3D 0; + + if (!pfn_valid(pfn)) + goto out_next; + + page =3D pfn_to_online_page(pfn); + if (!page) + goto out_next; + + folio =3D page_folio(page); + if (!folio_try_get(folio)) + goto out_next; + + if (unlikely(page_folio(page) !=3D folio)) { + folio_put(folio); + goto out_next; + } + + nr =3D folio_nr_pages(folio); + if (folio_nid(folio) !=3D src_nid) { + folio_put(folio); + goto out_next; + } + + if (!folio_test_lru(folio)) { + folio_put(folio); + goto out_next; + } + + if (pghot_get_hotness(pfn, &nid, &freq, &time)) { + folio_put(folio); + goto out_next; + } + + if (nid =3D=3D NUMA_NO_NODE) + nid =3D pghot_target_nid; + + if (folio_nid(folio) =3D=3D nid) { + folio_put(folio); + goto out_next; + } + + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) { + folio_put(folio); + goto out_next; + } + + memcg =3D folio_memcg(folio); + if (cur_nid =3D=3D NUMA_NO_NODE) { + cur_nid =3D nid; + cur_memcg =3D memcg; + } + + /* If NID or memcg changed, flush the previous batch first */ + if (cur_nid !=3D nid || cur_memcg !=3D memcg) { + if (!list_empty(&migrate_list)) + promote_misplaced_memcg_folios(&migrate_list, cur_nid); + cur_nid =3D nid; + cur_memcg =3D memcg; + batch_count =3D 0; + cond_resched(); + } + + list_add(&folio->lru, &migrate_list); + folio_put(folio); + + if (++batch_count > kmigrated_batch_nr) { + promote_misplaced_memcg_folios(&migrate_list, cur_nid); + batch_count =3D 0; + cond_resched(); + } +out_next: + pfn +=3D nr; + } while (pfn < end_pfn); + if (!list_empty(&migrate_list)) + promote_misplaced_memcg_folios(&migrate_list, cur_nid); +} + +static void kmigrated_do_work(pg_data_t *pgdat) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + start_pfn =3D section_nr_to_pfn(section_nr); + ms =3D __nr_to_section(section_nr); + + if (!pfn_valid(start_pfn)) + continue; + + nid =3D pfn_to_nid(start_pfn); + if (node_is_toptier(nid) || nid !=3D pgdat->node_id) + continue; + + if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot= _map)) + continue; + + kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION, + pgdat->node_id); + } +} + +static inline bool kmigrated_work_requested(pg_data_t *pgdat) +{ + return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); +} + +/* + * Per-node kthread that iterates over its PFNs and migrates the + * pages that have been marked for migration. + */ +static int kmigrated(void *p) +{ + pg_data_t *pgdat =3D p; + + while (!kthread_should_stop()) { + long timeout =3D msecs_to_jiffies(READ_ONCE(kmigrated_sleep_ms)); + + if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(p= gdat), + timeout)) + kmigrated_do_work(pgdat); + } + return 0; +} + +static int kmigrated_run(int nid) +{ + pg_data_t *pgdat =3D NODE_DATA(nid); + int ret; + + if (!pgdat->kmigrated) { + pgdat->kmigrated =3D kthread_create_on_node(kmigrated, pgdat, nid, + "kmigrated%d", nid); + if (IS_ERR(pgdat->kmigrated)) { + ret =3D PTR_ERR(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret); + return ret; + } + pr_info("pghot: Started kmigrated thread for node %d\n", nid); + } + wake_up_process(pgdat->kmigrated); + return 0; +} + +static void pghot_free_hot_map(struct mem_section *ms) +{ + kfree((void *)((unsigned long)ms->hot_map & ~PGHOT_SECTION_HOT_MASK)); + ms->hot_map =3D NULL; +} + +static int pghot_alloc_hot_map(struct mem_section *ms, int nid) +{ + ms->hot_map =3D kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_KE= RNEL, + nid); + if (!ms->hot_map) + return -ENOMEM; + return 0; +} + +static void pghot_offline_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + unsigned long start, end, pfn; + struct mem_section *ms; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + continue; + + pghot_free_hot_map(ms); + } +} + +static int pghot_online_sec_hotmap(unsigned long start_pfn, + unsigned long nr_pages) +{ + int nid =3D pfn_to_nid(start_pfn); + unsigned long start, end, pfn; + struct mem_section *ms; + int fail =3D 0; + + start =3D SECTION_ALIGN_DOWN(start_pfn); + end =3D SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn =3D start; !fail && pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (!ms || ms->hot_map) + continue; + + fail =3D pghot_alloc_hot_map(ms, nid); + } + + if (!fail) + return 0; + + /* rollback */ + end =3D pfn - PAGES_PER_SECTION; + for (pfn =3D start; pfn < end; pfn +=3D PAGES_PER_SECTION) { + ms =3D __pfn_to_section(pfn); + if (ms && ms->hot_map) + pghot_free_hot_map(ms); + } + return -ENOMEM; +} + +static int pghot_memhp_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct memory_notify *mn =3D arg; + int ret =3D 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret =3D pghot_online_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + case MEM_OFFLINE: + case MEM_CANCEL_ONLINE: + pghot_offline_sec_hotmap(mn->start_pfn, mn->nr_pages); + break; + } + + return notifier_from_errno(ret); +} + +static struct notifier_block pghot_mem_notifier =3D { + .notifier_call =3D pghot_memhp_callback, + .priority =3D DEFAULT_CALLBACK_PRI, +}; + +static void pghot_destroy_hot_map(void) +{ + unsigned long section_nr, s_begin; + struct mem_section *ms; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + pghot_free_hot_map(ms); + } + + unregister_memory_notifier(&pghot_mem_notifier); +} + +static int pghot_setup_hot_map(void) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid, ret; + + ret =3D register_memory_notifier(&pghot_mem_notifier); + if (ret) + return ret; + + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + start_pfn =3D section_nr_to_pfn(section_nr); + nid =3D pfn_to_nid(start_pfn); + + if (node_is_toptier(nid) || !pfn_valid(start_pfn)) + continue; + + if (pghot_alloc_hot_map(ms, nid)) + goto out_free_hot_map; + } + return 0; + +out_free_hot_map: + pghot_destroy_hot_map(); + return -ENOMEM; +} + +static int __init pghot_init(void) +{ + pg_data_t *pgdat; + int nid, ret; + + ret =3D pghot_setup_hot_map(); + if (ret) + return ret; + + for_each_node_state(nid, N_MEMORY) { + if (node_is_toptier(nid)) + continue; + + ret =3D kmigrated_run(nid); + if (ret) + goto out_stop_kthread; + } + register_sysctl_init("vm", pghot_sysctls); + pghot_debug_init(); + + kmigrated_started =3D true; + return 0; + +out_stop_kthread: + for_each_node_state(nid, N_MEMORY) { + pgdat =3D NODE_DATA(nid); + if (pgdat->kmigrated) { + kthread_stop(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + } + } + pghot_destroy_hot_map(); + return ret; +} + +late_initcall_sync(pghot_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..4064ead568cc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1489,6 +1489,11 @@ const char * const vmstat_text[] =3D { [I(KSTACK_REST)] =3D "kstack_rest", #endif #endif +#ifdef CONFIG_PGHOT + [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", + [I(PGHOT_RECORDED_HINTFAULTS)] =3D "pghot_recorded_hintfaults", + [I(PGHOT_RECORDED_HWHINTS)] =3D "pghot_recorded_hwhints", +#endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from BYAPR05CU005.outbound.protection.outlook.com (mail-westusazon11010065.outbound.protection.outlook.com [52.101.85.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B71A5246778 for ; Mon, 4 May 2026 06:10:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.85.65 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875044; cv=fail; b=AiZDM+oZXndTRR6fi/McfEFtso9cDzj+lFRDv91bgVCBTnDybY8AtqjAdD7p9xY6alSuC2Vt/nOZ4pt8Cq+jCWI0cda8IRzbTx04AziLyBeJWJ//h0KnVjuRatxQy15F7LCaYuG9zu0SrW3CRe4sOxn/169bvy4jKziox83Y7mY= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875044; c=relaxed/simple; bh=GwYF4YSzjkUb/47TiPYnuK9p8Xlmz2Uh98Ir5PJNNuY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=u7WwZNW8MDYzU0o2K2Fj6VQCLJU+VPnM5HQ5KQ4T5kQfC8uz/m7hBbMqfnmJYsCKTTVcmsrqumZaENDp9Otg4n+u+Xk04YLBywuOEkQP2F810ZPA7r3UcQ8YRu/96mqOIOEl4AtYyVOW7+2T36DlgYVQqfbPsWxxCTD+Giw27nU= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=F2oprmRN; arc=fail smtp.client-ip=52.101.85.65 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="F2oprmRN" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=MSITBQ4SiT76/2FYGvQ+fGu+6pCxKt7m+tYXzC2kZ8hEYqM/bFHnONjKR845gH7VhrytrPNht5HmvU7Yvfjn/E89ptxnPExO7WRlh9j4BHMzh2HHPXt4PxrctBW962mn10iMBZ+lXQWMFJ/98UMSv90LpCbu+Pm9dus667ojtTQWszJjKGErT38QVLKbXAUGrUnhP2wW/4qcnm0ukaLxkhajd6g41iem6J92/dwydFijE6H40wZpYDHps24qHnt3IhOn8BuETGlUJ6uyFFSc9e03/Zq/3Cew31Xt5UFO17nqjr72OZLvOXkF2kPxaCEVfRmt60lavbIojKfSGx34+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=OKkOxvZcc/lEXyHwHuHe7DdBQkLVKZm5h+kQktzq8ms=; b=VScIK/iKQYM2VPutqljs4j/bHpVzRnG5QZPiOa8Noky239i2a603mdLqDka5gdhU+9vglnuZOuBmuzfiLApl9uURSrYAfAuTdy/T+Y1pT4DTBR9hK1VxSUcu+zHF7rpczbc7grJY7AGPwrjdROkXkInBRMIwXLRJgK4Cw/BhlfTjtwXe0YUH3d3yo17tBytZT3DGLmmavyJKBG7pRnthnI6UV1b//4dpjWZv6kMXq9Cuqf25KHxXFlELw89B5hIgXeL2IgabyWpT1UQONiLSf/61m7uoFB0Fk3cISmfH/N3EZw61nw8a3Z2sVX2JeN+Qhl1M4CCz81e8/Qg+39kPIA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=OKkOxvZcc/lEXyHwHuHe7DdBQkLVKZm5h+kQktzq8ms=; b=F2oprmRNCphScDDwcVjyk+lKb8eqGtQmdkJGi/HdVhSsHqb1RrmLswCUTHpDeyBMOo2MSutyzuuycbr5eSL/2L640qHoSGrRwjrWiubxculCTCgNNXLdAw0ONwUOki4gpMAU0alEAA/0vVRrazh67qbdDQT+gyIbIcuwP/2zjtA= Received: from CH5P222CA0015.NAMP222.PROD.OUTLOOK.COM (2603:10b6:610:1ee::13) by LV3PR12MB9257.namprd12.prod.outlook.com (2603:10b6:408:1b7::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:10:36 +0000 Received: from CH1PEPF0000AD7B.namprd04.prod.outlook.com (2603:10b6:610:1ee:cafe::18) by CH5P222CA0015.outlook.office365.com (2603:10b6:610:1ee::13) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:10:36 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD7B.mail.protection.outlook.com (10.167.244.58) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:36 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:27 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v7 4/7] mm: pghot: Precision mode for pghot Date: Mon, 4 May 2026 11:39:21 +0530 Message-ID: <20260504060924.344313-5-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD7B:EE_|LV3PR12MB9257:EE_ X-MS-Office365-Filtering-Correlation-Id: 18507e61-4a3b-44d9-aa3e-08dea9a3db8c X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|36860700016|1800799024|376014|7416014|56012099003|22082099003|18002099003; X-Microsoft-Antispam-Message-Info: 99on9g/1DurDiEaYS3wZGy+2qYqlsz4isLyjwGBjxDzcmaMzn6xp3G5a6qjzOmSN/SSPTcj9lk8+gAjRt5uwsmrPM+ewPZOsTiMFEKbdudo3pwmq/fQVJ4jI5tifkiMj40joAPVH4w0jZKSjIoDsaNRcITq4auCV3Y6RKJOFO9zi5IlZrOZDTX1W55XVg5id12/MVRZJLVAJOKsNjM2o4GB/22ILDFcQ5rPxQpKjkQsm59OCOKNUROexHiTc/RiS8wsLdp42bVrFMeDRaEduNmTQKuxAhXzDzckmymyTookYiG28Mv2CG2iXWzZwkbYFmO5ivo0raIKWISiyR3D90DOPLmMF/c7R9sd6WNsmJpDB7XGYA+b8Z6MBVqWzyN2Hgf3Xo+5pr1jqbqPZ24xdU4hxbJ9r5DzkwUNh5GhuhMM/NjSqoIe+VVwnwTU1PWOzPII7a36CsZ0bw0KNu6GBsu2niiWC8C2PoYa0HAmUY6H5VfX9PSQ6ETDc4lBS8JBM5AWbqRcHVbehIVZrVEEHUDWNN+5cQuvkyQZzV5gVTEg86GIUNu8Ea/DWfC+E2TsPn8P2x8KqSkgWjCJfkxffEQt395FnbQYmTubyq4OqSaHfwfzmIOvmybqxNAm24KRiKAmNViCk7k4uKplUA8NLB+M4JRrhzKMr9L8KM3+aAcNYAhBMojkU15KqsINznPZpNcWcFYjJDQ/whJSJTgOb27VanN7yJos3bkWBRnFxjGs= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(36860700016)(1800799024)(376014)(7416014)(56012099003)(22082099003)(18002099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: xpM7qvBc9Zz2FtaGU9qXwQgT0jjxLarCWOt70k5DZYnhMDFf63rnW9R4beKcbzxeFmDaU1Wi/IYGqpbY8/UohOg0gnF7FQlVhWiRlhkLqc3IQuqVzBKGfE16ZJmKdx0MVmRkxvQK2HCe24abzfxi0eg3s7S6c1Zk9qt/BD6OrcXwVZBmNUi6S70qAs5zOoCWz1zC7gEQKDtMQ8d1LbzA7gD7RRNQi7HodnUKXLshSSPezsH9y5l5kOR7TofcvOK7CxbrYo+xMr9qHRTuMdDoV/HC4Nj0lnnXeT4w4nE8RjMh6hufuUtxkx7LDv0oTHGOZv+MVxmzlhI77l1G4eGREtJaqaCQs9Q9JF121P2cXQH4pIQu0rp9+Gk0ElvtAeZxLc7Y4wNqmwGI40efk7xbkS0DJtskhYZJprNGBv0EvUU+q0/B9SKUFc0Nk6smepKD X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:36.0809 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 18507e61-4a3b-44d9-aa3e-08dea9a3db8c X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD7B.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV3PR12MB9257 Default pghot stores hotness in a 1=E2=80=91byte record per PFN, limiting frequency to 2 bits, time to a 5=E2=80=91bit bucket, and preventing storage of per=E2=80=91PFN toptier NID. This restricts time granularity and forces all promotions to use the global pghot_target_nid. This patch adds an optional precision mode (CONFIG_PGHOT_PRECISE) that expands the hotness record to 4 bytes (u32) and provides: - 10=E2=80=91bit NID field for per=E2=80=91PFN promotion target, - 3=E2=80=91bit frequency field (freq_threshold range 1=E2=80=937), - 14=E2=80=91bit time field offering finer recency tracking, - MSB migrate=E2=80=91ready bit. Precision mode improves placement accuracy on systems with multiple toptier nodes and provides higher=E2=80=91resolution hotness tracking, at the cost of increasing metadata to 4 bytes per PFN. Documentation, tunables, and the record layout are updated accordingly. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.rst | 4 +- include/linux/mmzone.h | 2 +- include/linux/pghot.h | 31 ++++++++++ mm/Kconfig | 11 ++++ mm/Makefile | 7 ++- mm/pghot-precise.c | 81 ++++++++++++++++++++++++++ mm/pghot.c | 13 +++-- 7 files changed, 141 insertions(+), 8 deletions(-) create mode 100644 mm/pghot-precise.c diff --git a/Documentation/admin-guide/mm/pghot.rst b/Documentation/admin-g= uide/mm/pghot.rst index 5f51dd1d4d45..7b84e911afe7 100644 --- a/Documentation/admin-guide/mm/pghot.rst +++ b/Documentation/admin-guide/mm/pghot.rst @@ -37,7 +37,7 @@ Path: /sys/kernel/debug/pghot/ =20 3. **freq_threshold** - Minimum access frequency before a page is marked ready for promotion. - - Range: 1 to 3 + - Range: 1 to 3 in default mode, 1 to 7 in precision mode. - Default: 2 - Example: # echo 3 > /sys/kernel/debug/pghot/freq_threshold @@ -59,7 +59,7 @@ Path: /proc/sys/vm/pghot_promote_freq_window_ms - Controls the time window (in ms) for counting access frequency. A page is considered hot only when **freq_threshold** number of accesses occur with this time period. -- Default: 3000 (3 seconds) +- Default: 3000 (3 seconds) in default mode and 5000 (5s) in precision mod= e. - Example: # sysctl vm.pghot_promote_freq_window_ms=3D3000 =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index eb08431dc9fb..9577bdc575d9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2027,7 +2027,7 @@ struct mem_section { #ifdef CONFIG_PGHOT /* * Per-PFN hotness data for this section. - * Array of phi_t (u8 in default mode). + * Array of phi_t (u8 in default mode, u32 in precision mode). * LSB is used as PGHOT_SECTION_HOT_BIT flag. */ void *hot_map; diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 525d4dd28fc1..2e1742b8caee 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -35,6 +35,36 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); =20 #define PGHOT_DEFAULT_NODE 0 =20 +#if defined(CONFIG_PGHOT_PRECISE) +#define PGHOT_DEFAULT_FREQ_WINDOW (5 * MSEC_PER_SEC) + +/* + * Bits 0-26 are used to store nid, frequency and time. + * Bits 27-30 are unused now. + * Bit 31 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 31 + +#define PGHOT_NID_WIDTH 10 +#define PGHOT_FREQ_WIDTH 3 +/* time is stored in 14 bits which can represent up to 16s with HZ=3D1000 = */ +#define PGHOT_TIME_WIDTH 14 + +#define PGHOT_NID_SHIFT 0 +#define PGHOT_FREQ_SHIFT (PGHOT_NID_SHIFT + PGHOT_NID_WIDTH) +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_NID_MASK GENMASK(PGHOT_NID_WIDTH - 1, 0) +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u32 phi_t; + +#else /* !CONFIG_PGHOT_PRECISE */ #define PGHOT_DEFAULT_FREQ_WINDOW (3 * MSEC_PER_SEC) =20 /* @@ -61,6 +91,7 @@ DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); #define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) =20 typedef u8 phi_t; +#endif /* CONFIG_PGHOT_PRECISE */ =20 #define PGHOT_RECORD_SIZE sizeof(phi_t) =20 diff --git a/mm/Kconfig b/mm/Kconfig index ebfa149d8123..cc4b5685ecd4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1483,6 +1483,17 @@ config PGHOT This adds 1 byte of metadata overhead per page in lower-tier memory nodes. =20 +config PGHOT_PRECISE + bool "Hot page tracking precision mode" + default n + depends on PGHOT + help + Enables precision mode for tracking hot pages with pghot sub-system. + Adds fine-grained access time tracking and explicit toptier target + NID tracking. Precise hot page tracking comes at the cost of using + 4 bytes per page against the default one byte per page. Preferable + to enable this on systems with multiple nodes in toptier. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 33014de43acc..dc61f4d955f8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -150,4 +150,9 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_LAZY_MMU_MODE_KUNIT_TEST) +=3D tests/lazy_mmu_mode_kunit.o -obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o +ifdef CONFIG_PGHOT_PRECISE +obj-$(CONFIG_PGHOT) +=3D pghot-precise.o +else +obj-$(CONFIG_PGHOT) +=3D pghot-default.o +endif diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c new file mode 100644 index 000000000000..8e571988b4ce --- /dev/null +++ b/mm/pghot-precise.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Precision mode + * + * 4 byte hotness record per PFN (u32) + * NID, time and frequency tracked as part of the record. + */ + +#include +#include +#include + +bool pghot_nid_valid(int nid) +{ + if (nid !=3D NUMA_NO_NODE && + (!numa_valid_node(nid) || nid > PGHOT_NID_MAX || + !node_online(nid) || !node_is_toptier(nid))) + return false; + + return true; +} + +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now & PGHOT_TIME_MASK; + + nid =3D (nid =3D=3D NUMA_NO_NODE) ? pghot_target_nid : nid; + old_hotness =3D READ_ONCE(*phi); + + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT); + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT; + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D (old_hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot.c b/mm/pghot.c index 02e6959b647a..0b31d5917833 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -10,6 +10,9 @@ * the frequency of access and last access time. Promotions are done * to a default toptier NID. * + * In the precision mode, 4 bytes are used to store the frequency + * of access, last access time and the accessing NID. + * * A kernel thread named kmigrated is provided to migrate or promote * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into @@ -52,13 +55,15 @@ static bool kmigrated_started __ro_after_init; * for the purpose of tracking page hotness and subsequent promotion. * * @pfn: PFN of the page - * @nid: Unused + * @nid: Target NID to where the page needs to be migrated in precision + * mode but unused in default mode * @src: The identifier of the sub-system that reports the access * @now: Access time in jiffies * - * Updates the frequency and time of access and marks the page as - * ready for migration if the frequency crosses a threshold. The pages - * marked for migration are migrated by kmigrated kernel thread. + * Updates the NID (in precision mode only), frequency and time of access + * and marks the page as ready for migration if the frequency crosses a + * threshold. The pages marked for migration are migrated by kmigrated + * kernel thread. * * Return: 0 on success and -EINVAL on failure to record the access. */ --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from CO1PR03CU002.outbound.protection.outlook.com (mail-westus2azon11010040.outbound.protection.outlook.com [52.101.46.40]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D9662236E0 for ; Mon, 4 May 2026 06:10:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.46.40 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875053; cv=fail; b=syebOJM6Ti+A/CCZwz5yz+Gv4oLP9TaU6A5xlVQj6FZiJNyhKsiHD+g3RpWd0EvUpMOYN/sfBwiZxYvynrjpR7cCOagh7gNEyvX1F7tfAm/eAfCcX5fPGk6dU7GtmGz6BhmhQcidz1VM1obeOubYYRh9TIs0p7d2iG7W3E/dyB4= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875053; c=relaxed/simple; bh=DjhhnhjrQ7wVrPg68owuao0+3KnoDgmIjqe6Khkb2+c=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=f9J9lzr9Zj7N6PGsp2s2VcUsrMcMDydJXWwdNsiv2Zk1ak+lKscWx+7mjMdHhk4LLagJ4fc+IGmYiqP0tZUWEefdcwPrfm+qUvEwTQ808tKm9yX0R3TcUNMsTcxPErwg21b+Vw4ufreLGl0yIqQ0izvyKsHESmHwcUSJgglx3J0= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=VaFjDicP; arc=fail smtp.client-ip=52.101.46.40 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="VaFjDicP" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=xaIFCa62i+jmlQWy6yem7UnxnzUK0/OtwQJ0ddqr7ffv+Y3qOgqGT0mO6B1n3qxUZXy1tXekC579tgyp1AlAAOVRJXB4HdY9AQFEyBTzdpr36B9rbD/rH4zrVmnvLakxVMYJXy1zd8twLLG6ya2T957LLKzN8URJREXsjgbd+dZtxWlNAwW/UUtuQI4vYxSxNK8fgY6qR4jDV8F7XRhbvKUNzfv+2/I+Kmu60H7JxH/uAByLiVihqtCepCZZ/egvrZHJ/VQm5s2lVjEqEdKjRe21xb8TeKy1OLBRrqb9F/6/qATH6Z2zlxX8CWP0/DzS0br8uaZPTwZ8xLGovfzd8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=59jb1BravKhHJMa6QBqdlLHgK9P5LxuGuAFPLje6U8k=; b=wHi8aw9aLMxb7f9LhT3q/4zdM/sLwyHEhHppJTRCH+LtvcTuPAesoFCrGLKguwyOIM0NVw/ZZnLfpEzpCfabIfwTV8CTehDkXd/5/Oijilbo1Jfn1STom76K9X7qhI+QfRaVvYwnSeCAoEnuK2kgVqtwrJBQpxuyX+GqS/0D/7/hs461SZZAr2LbutMcaXCE/ZUgP8bb4Qb4QnPSxLoyxqbBNP6Z3j+mIFBPi1crz7pkHY+rnfk8MDoaVt+XHrh+Mh/1LXyrLlWAyQpmU5vE247J4khE4d8w1Hf2d6IqEtd93ExHIuAi7s9gt40UYiVl46rq4RM/C+9ExqbFG3W4yQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=59jb1BravKhHJMa6QBqdlLHgK9P5LxuGuAFPLje6U8k=; b=VaFjDicPPHMEWp9k5KELEnZfvJw7eqSRyb6QfbUHA039Z/rpOzUd54zaUEPqDKGrLUw/mKeqSoCH9nHNzZjNJRxnbxVNDIXnBGsyAhaU6JmvaCOgZKPFLMu3ZQaX4yr0SeMqJ3XtdATq9DOPD7xR5lA2YAqDZrZtqpC1+K+Uf7A= Received: from CH2PR10CA0024.namprd10.prod.outlook.com (2603:10b6:610:4c::34) by IA1PR12MB6042.namprd12.prod.outlook.com (2603:10b6:208:3d6::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:10:45 +0000 Received: from CH1PEPF0000AD7A.namprd04.prod.outlook.com (2603:10b6:610:4c:cafe::c3) by CH2PR10CA0024.outlook.office365.com (2603:10b6:610:4c::34) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.22 via Frontend Transport; Mon, 4 May 2026 06:10:45 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD7A.mail.protection.outlook.com (10.167.244.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:44 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:36 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Date: Mon, 4 May 2026 11:39:22 +0530 Message-ID: <20260504060924.344313-6-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD7A:EE_|IA1PR12MB6042:EE_ X-MS-Office365-Filtering-Correlation-Id: 868e5b36-7f38-4116-d857-08dea9a3e0d3 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|82310400026|7416014|376014|36860700016|18002099003|22082099003|56012099003; X-Microsoft-Antispam-Message-Info: GYigqpm/kBdsojbVXNz1mcuo+Ka0cvgw8iBNhwXx2DvlPu4AgIbpMMxgLaOQa02iVR2pMPNIx2vaOAw1WnL0Vg8vNArj5sMxUw1Gx/ZHLF7IgazDmtH0aVKgwkz3LQV+QG38WSZf9uIWFBR4KZhXpGeHRrilw26mawP/xXaaKwod85Tn1ubc0IPl9FsTtZHf7gpU5wZjKP0BhMwqkJuz3lI9+Mss5TvpffKkaIfLbNIc/pIJ2rdnPups2mzdBge5kTU5cxoTsSoBeRm6N0I2LNe6BSViV/nxBDbtNU7kUQDFHs2lU79mPguL6ogOcf5ubFnjwUJ3+T69th+/bHGPTE3X0aK0eYN20DWUdG+94bVmRrmWiEaknN4C/9qAWR35Sj7fbIac76NO/IyU579fJNSFPImzfe6jz0P6xqI0lJpT7byU8LRwdd5JCkMsCQaWlSx0xDy0RFyUY8XH4WKD75nqvLDZKMuaDUEtYzBOBR0lAvUlr3TZz8yV1EMN5b3+n0c+u8HSDSvOmIzH0aTUL2HqQyXgcVUIXBL4lLXj4KMZSDV8y6IhrYoBNFFw/QI8qYyyJ+GcfyCNYmMHvlyJ/J1R3pbAlPYOl8bZRcAJTIabBU1WmzWabl/kMpgnr7dPkxARLSglGBaEf1WBWjRr1PbFJeEZQ+npvOZGc6crFKlIV9ONcnh/jaz416SalOjqbqqeZ6l6ALnrZ1jzqnvKTilrDr6YQrPqJkHaUwlOogQ= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(82310400026)(7416014)(376014)(36860700016)(18002099003)(22082099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: VsQTELu00lQ0s47v5u4PbuWaRBEPe27iDcN+FCpfUd12QS0RXO6EC76hlmFxRPp69vaAQXzHzNcGkWtNq0qmbXCt923WlX8NeFiCGG5v0DjlPoEFylgK2CNpTqR02gzH7tcjgc5DFY1hgEq2A+h1GXpkx1KyEwso2coLv4h6XBjtbO5hsC3VqJSC4D21cWqNlv+5fd0KedYHFdDj9DHwuKYbhMnHD35rPZ8PqrCzQsftUrY4NT7em0qqy4JalZ+r5q95fdt3/Whn7KVvgQJk5AoXZAZXnzRDKvq50LlvYTnLtfZhCfid3KaqoRfVXE307QLIkTgG9wCNO1xTAr8nouBSnODwdj263o55vLqPSdmq/xXG2WpTy8nPfO354+MGlx2/vne2UYG8FpwW2zjhOvTH1JASyszHoJGl6EJYCluKg/H5tbFIuKKznJjoSejK X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:44.9350 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 868e5b36-7f38-4116-d857-08dea9a3e0d3 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD7A.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB6042 Content-Type: text/plain; charset="utf-8" Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING mode of NUMA Balancing) does hot page detection (via hint faults), hot page classification and eventual promotion, all by itself and sits within the scheduler. With pghot, the new hot page tracking and promotion mechanism being available, NUMA Balancing can limit itself to detection of hot pages (via hint faults) and off-load rest of the functionality to pghot. To achieve this, pghot_record_access(PGHOT_HINTFAULTS) API is used to feed the hot page info to pghot. In addition, the migration rate limiting and dynamic threshold logic are moved to kmigrated so that the same can be used for hot pages reported by other sources too. Hence it becomes necessary to introduce a new config option CONFIG_NUMA_BALANCING_TIERING to control the hint faults souce for hot page promotion. This option controls the NUMA_BALANCING_MEMORY_TIERING mode of kernel.numa_balancing This movement of hot page promotion to pghot results in the following changes to the behaviour of hint faults based hot page promotion: 1. Promotion is no longer done in the fault path but instead is deferred to kmigrated and happens in batches. 2. NUMA_BALANCING_MEMORY_TIERING mode used to promote on first access. Pghot by default, promotes on second access though this can be changed by setting /sys/kernel/debug/pghot/freq_threshold. hot_threshold_ms debugfs tunable now gets replaced by pghot's freq_threshold. 3. In NUMA_BALANCING_MEMORY_TIERING mode, hint fault latency is the difference between the PTE update time (during scanning) and the access time (hint fault). However with pghot, a single latency threshold is used for two purposes: a) If the time difference between successive accesses are within the threshold, the page is marked as hot. b) Later when kmigrated picks up the page for migration, it will migrate only if the difference between the current time and the time when the page was marked hot is with the threshold. 4. Batch migration of misplaced folios is done from non-process context where VMA info is not readily available. Without VMA and the exec check on that, it will not be possible to filter out exec pages during migration prep stage. Hence shared executable pages also will be subjected to misplaced migration. 5. The max scan period which is used in dynamic threshold logic was a debugfs tunable. However this has been converted to a scalar metric in pghot. 6. In the uncommon case of using NUMA_BALANCING_NORMAL mode to balance between lower and higher tier nodes, we end up waking the kswapd when there is no headroom in the toptier. Key code changes due to this movement are detailed below to help easy understanding of the restructuring. 1. Scanning and access times are no longer tracked in last_cpupid field of folio flags. Hence all code related to this (like folio_xchg_access_time(), cpupid_valid()) are removed. 2. The misplaced migration routines become conditional to CONFIG_PGHOT in addition to CONFIG_NUMA_BALANCING. 3. The promotion related stats (like PGPROMOTE_SUCCESS etc) are now moved to under CONFIG_PGHOT as these stats are part of promotion engine which will be used for other hotness sources as well. 4. Routines that are responsibile for migration rate limiting dynamic thresholding, pgdat balancing during promotion etc are moved to pghot with appropriate renaming. Signed-off-by: Bharata B Rao --- include/linux/mm.h | 35 ++------ include/linux/mmzone.h | 4 +- init/Kconfig | 13 +++ kernel/sched/core.c | 7 ++ kernel/sched/debug.c | 1 - kernel/sched/fair.c | 177 ++--------------------------------------- kernel/sched/sched.h | 1 - mm/huge_memory.c | 24 +++++- mm/memcontrol.c | 6 +- mm/memory-tiers.c | 15 ++-- mm/memory.c | 28 +++++-- mm/mempolicy.c | 3 - mm/migrate.c | 16 +++- mm/pghot.c | 134 +++++++++++++++++++++++++++++++ mm/vmstat.c | 2 +- 15 files changed, 239 insertions(+), 227 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0b776907152e..3b237946b322 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2271,17 +2271,6 @@ static inline int folio_nid(const struct folio *foli= o) } =20 #ifdef CONFIG_NUMA_BALANCING -/* page access time bits needs to hold at least 4 seconds */ -#define PAGE_ACCESS_TIME_MIN_BITS 12 -#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS -#define PAGE_ACCESS_TIME_BUCKETS \ - (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT) -#else -#define PAGE_ACCESS_TIME_BUCKETS 0 -#endif - -#define PAGE_ACCESS_TIME_MASK \ - (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS) =20 static inline int cpu_pid_to_cpupid(int cpu, int pid) { @@ -2347,15 +2336,6 @@ static inline void page_cpupid_reset_last(struct pag= e *page) } #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */ =20 -static inline int folio_xchg_access_time(struct folio *folio, int time) -{ - int last_time; - - last_time =3D folio_xchg_last_cpupid(folio, - time >> PAGE_ACCESS_TIME_BUCKETS); - return last_time << PAGE_ACCESS_TIME_BUCKETS; -} - static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) { unsigned int pid_bit; @@ -2366,18 +2346,12 @@ static inline void vma_set_access_pid_bit(struct vm= _area_struct *vma) } } =20 -bool folio_use_access_time(struct folio *folio); #else /* !CONFIG_NUMA_BALANCING */ static inline int folio_xchg_last_cpupid(struct folio *folio, int cpupid) { return folio_nid(folio); /* XXX */ } =20 -static inline int folio_xchg_access_time(struct folio *folio, int time) -{ - return 0; -} - static inline int folio_last_cpupid(struct folio *folio) { return folio_nid(folio); /* XXX */ @@ -2420,11 +2394,16 @@ static inline bool cpupid_match_pid(struct task_str= uct *task, int cpupid) static inline void vma_set_access_pid_bit(struct vm_area_struct *vma) { } -static inline bool folio_use_access_time(struct folio *folio) +#endif /* CONFIG_NUMA_BALANCING */ + +#ifdef CONFIG_NUMA_BALANCING_TIERING +bool folio_is_promo_candidate(struct folio *folio); +#else +static inline bool folio_is_promo_candidate(struct folio *folio) { return false; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_NUMA_BALANCING_TIERING */ =20 #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9577bdc575d9..b29d06168826 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -287,7 +287,7 @@ enum node_stat_item { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT PGPROMOTE_SUCCESS, /* promote successfully */ /** * Candidate pages for promotion based on hint fault latency. This @@ -1566,7 +1566,7 @@ typedef struct pglist_data { struct deferred_split deferred_split_queue; #endif =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT /* start time in ms of current promote rate limit period */ unsigned int nbp_rl_start; /* number of promote candidate pages at start time of current rate limit = period */ diff --git a/init/Kconfig b/init/Kconfig index 2937c4d308ae..7624be1c739a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1027,6 +1027,19 @@ config NUMA_BALANCING_DEFAULT_ENABLED If set, automatic NUMA balancing will be enabled if running on a NUMA machine. =20 +config NUMA_BALANCING_TIERING + bool "NUMA balancing memory tiering promotion" + depends on NUMA_BALANCING && PGHOT + help + Enable NUMA balancing mode 2 (memory tiering). This allows + automatic promotion of hot pages from slower memory tiers to + faster tiers using the pghot subsystem. + + This requires CONFIG_PGHOT for the hot page tracking engine. + This option is required for kernel.numa_balancing=3D2. + + If unsure, say N. + config SLAB_OBJ_EXT bool =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da20fb6ea25a..46ce75f00b40 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4523,6 +4523,7 @@ void set_numabalancing_state(bool enabled) } =20 #ifdef CONFIG_PROC_SYSCTL +#ifdef CONFIG_NUMA_BALANCING_TIERING static void reset_memory_tiering(void) { struct pglist_data *pgdat; @@ -4533,6 +4534,7 @@ static void reset_memory_tiering(void) pgdat->nbp_th_start =3D jiffies_to_msecs(jiffies); } } +#endif =20 static int sysctl_numa_balancing(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -4550,9 +4552,14 @@ static int sysctl_numa_balancing(const struct ctl_ta= ble *table, int write, if (err < 0) return err; if (write) { + if ((state & NUMA_BALANCING_MEMORY_TIERING) && + !IS_ENABLED(CONFIG_NUMA_BALANCING_TIERING)) + return -EOPNOTSUPP; +#ifdef CONFIG_NUMA_BALANCING_TIERING if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && (state & NUMA_BALANCING_MEMORY_TIERING)) reset_memory_tiering(); +#endif sysctl_numa_balancing_mode =3D state; __set_numabalancing_state(state); } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 74c1617cf652..abf53f3071ea 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -623,7 +623,6 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_min); debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_max); debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_sca= n_size); - debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69361c63353a..f1da4fa95598 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu) static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ -static unsigned int sysctl_numa_balancing_promote_rate_limit =3D 65536; -#endif - #ifdef CONFIG_SYSCTL static const struct ctl_table sched_fair_sysctls[] =3D { #ifdef CONFIG_CFS_BANDWIDTH @@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] =3D= { .extra1 =3D SYSCTL_ONE, }, #endif -#ifdef CONFIG_NUMA_BALANCING - { - .procname =3D "numa_balancing_promote_rate_limit_MBps", - .data =3D &sysctl_numa_balancing_promote_rate_limit, - .maxlen =3D sizeof(unsigned int), - .mode =3D 0644, - .proc_handler =3D proc_dointvec_minmax, - .extra1 =3D SYSCTL_ZERO, - }, -#endif /* CONFIG_NUMA_BALANCING */ }; =20 static int __init sched_fair_sysctl_init(void) @@ -1612,9 +1597,6 @@ unsigned int sysctl_numa_balancing_scan_size =3D 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in m= s */ unsigned int sysctl_numa_balancing_scan_delay =3D 1000; =20 -/* The page with hint page fault latency < threshold in ms is considered h= ot */ -unsigned int sysctl_numa_balancing_hot_threshold =3D MSEC_PER_SEC; - struct numa_group { refcount_t refcount; =20 @@ -1957,120 +1939,6 @@ static inline unsigned long group_weight(struct tas= k_struct *p, int nid, return 1000 * faults / total_faults; } =20 -/* - * If memory tiering mode is enabled, cpupid of slow memory page is - * used to record scan time instead of CPU and PID. When tiering mode - * is disabled at run time, the scan time (in cpupid) will be - * interpreted as CPU and PID. So CPU needs to be checked to avoid to - * access out of array bound. - */ -static inline bool cpupid_valid(int cpupid) -{ - return cpupid_to_cpu(cpupid) < nr_cpu_ids; -} - -/* - * For memory tiering mode, if there are enough free pages (more than - * enough watermark defined here) in fast memory node, to take full - * advantage of fast memory capacity, all recently accessed slow - * memory pages will be migrated to fast memory node without - * considering hot threshold. - */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) -{ - int z; - unsigned long enough_wmark; - - enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); - for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { - struct zone *zone =3D pgdat->node_zones + z; - - if (!populated_zone(zone)) - continue; - - if (zone_watermark_ok(zone, 0, - promo_wmark_pages(zone) + enough_wmark, - ZONE_MOVABLE, 0)) - return true; - } - return false; -} - -/* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, - * - * hint page fault latency =3D hint page fault time - scan time - * - * The smaller the hint page fault latency, the higher the possibility - * for the page to be hot. - */ -static int numa_hint_fault_latency(struct folio *folio) -{ - int last_time, time; - - time =3D jiffies_to_msecs(jiffies); - last_time =3D folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; -} - -/* - * For memory tiering mode, too high promotion/demotion throughput may - * hurt application latency. So we provide a mechanism to rate limit - * the number of pages that are tried to be promoted. - */ -static bool numa_promotion_rate_limit(struct pglist_data *pgdat, - unsigned long rate_limit, int nr) -{ - unsigned long nr_cand; - unsigned int now, start; - - now =3D jiffies_to_msecs(jiffies); - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - start =3D pgdat->nbp_rl_start; - if (now - start > MSEC_PER_SEC && - cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) - pgdat->nbp_rl_nr_cand =3D nr_cand; - if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) - return true; - return false; -} - -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now =3D jiffies_to_msecs(jiffies); - th_period =3D sysctl_numa_balancing_scan_period_max; - start =3D pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { - ref_cand =3D rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; - unit_th =3D ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th =3D pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th =3D max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th =3D min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand =3D nr_cand; - pgdat->nbp_threshold =3D th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -2086,41 +1954,15 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct folio *folio, =20 /* * The pages in slow memory node should be migrated according - * to hot/cold instead of private/shared. - */ - if (folio_use_access_time(folio)) { - struct pglist_data *pgdat; - unsigned long rate_limit; - unsigned int latency, th, def_th; - long nr =3D folio_nr_pages(folio); - - pgdat =3D NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { - /* workload changed, reset hot threshold */ - pgdat->nbp_threshold =3D 0; - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr); - return true; - } - - def_th =3D sysctl_numa_balancing_hot_threshold; - rate_limit =3D MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th =3D pgdat->nbp_threshold ? : def_th; - latency =3D numa_hint_fault_latency(folio); - if (latency >=3D th) - return false; - - return !numa_promotion_rate_limit(pgdat, rate_limit, nr); - } + * to hot/cold instead of private/shared. Also the migration + * of such pages are handled by kmigrated. + */ + if (folio_is_promo_candidate(folio)) + return true; =20 this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D folio_xchg_last_cpupid(folio, this_cpupid); =20 - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && - !node_is_toptier(src_nid) && !cpupid_valid(last_cpupid)) - return false; - /* * Allow first faults or private faults to migrate immediately early in * the lifetime of a task. The magic number 4 is based on waiting for @@ -3330,15 +3172,6 @@ void task_numa_fault(int last_cpupid, int mem_node, = int pages, int flags) if (!p->mm) return; =20 - /* - * NUMA faults statistics are unnecessary for the slow memory - * node for memory tiering mode. - */ - if (!node_is_toptier(mem_node) && - (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING || - !cpupid_valid(last_cpupid))) - return; - /* Allocate buffer to track faults on a per-node basis */ if (unlikely(!p->numa_faults)) { int size =3D sizeof(*p->numa_faults) * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9f63b15d309d..f176643516b5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3066,7 +3066,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; -extern unsigned int sysctl_numa_balancing_hot_threshold; =20 #ifdef CONFIG_SCHED_HRTICK =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 970e077019b7..1890b1e534a4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -2267,7 +2268,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) int nid =3D NUMA_NO_NODE; int target_nid, last_cpupid; pmd_t pmd, old_pmd; - bool writable =3D false; + bool writable =3D false, needs_promotion =3D false; int flags =3D 0; =20 vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); @@ -2294,11 +2295,23 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) goto out_map; =20 nid =3D folio_nid(folio); + needs_promotion =3D folio_is_promo_candidate(folio); =20 target_nid =3D numa_migrate_check(folio, vmf, haddr, &flags, writable, &last_cpupid); if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; + + if (needs_promotion) { + /* + * Hot page promotion, mode=3DNUMA_BALANCING_MEMORY_TIERING. + * Isolation and migration are handled by pghot. + */ + nid =3D target_nid; + goto out_map; + } + + /* Balancing b/n toptier nodes, mode=3DNUMA_BALANCING_NORMAL */ if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { flags |=3D TNF_MIGRATE_FAIL; goto out_map; @@ -2330,8 +2343,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + if (nid !=3D NUMA_NO_NODE) { + if (needs_promotion) + pghot_record_access(folio_pfn(folio), nid, + PGHOT_HINTFAULTS, jiffies); + else + task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + } return 0; } =20 diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c3d98ab41f1f..033b80ad248e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -400,7 +400,7 @@ static const unsigned int memcg_node_stat_items[] =3D { #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT PGPROMOTE_SUCCESS, #endif PGDEMOTE_KSWAPD, @@ -1594,7 +1594,7 @@ static const struct memory_stat memory_stats[] =3D { { "pgscan_khugepaged", PGSCAN_KHUGEPAGED }, { "pgscan_proactive", PGSCAN_PROACTIVE }, { "pgrefill", PGREFILL }, -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif }; @@ -1646,7 +1646,7 @@ static int memcg_page_state_output_unit(int item) case PGSCAN_KHUGEPAGED: case PGSCAN_PROACTIVE: case PGREFILL: -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT case PGPROMOTE_SUCCESS: #endif return 1; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 54851d8a195b..be134a32f5bf 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -51,18 +51,19 @@ static const struct bus_type memory_tier_subsys =3D { .dev_name =3D "memory_tier", }; =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_NUMA_BALANCING_TIERING /** - * folio_use_access_time - check if a folio reuses cpupid for page access = time + * folio_is_promo_candidate - check if the folio qualifies for promotion + * * @folio: folio to check * - * folio's _last_cpupid field is repurposed by memory tiering. In memory - * tiering mode, cpupid of slow memory folio (not toptier memory) is used = to - * record page access time. + * Checks if NUMA Balancing tiering mode is set and the folio belongs + * to lower tier. If so, it qualifies for promotion to toptier when + * it is categorized as hot. * - * Return: the folio _last_cpupid is used to record page access time + * Return: True if the above condition is met, else False. */ -bool folio_use_access_time(struct folio *folio) +bool folio_is_promo_candidate(struct folio *folio) { return (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)); diff --git a/mm/memory.c b/mm/memory.c index ea6568571131..17ea31750573 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -75,6 +75,7 @@ #include #include #include +#include #include #include #include @@ -6062,10 +6063,9 @@ int numa_migrate_check(struct folio *folio, struct v= m_fault *vmf, if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) *flags |=3D TNF_SHARED; /* - * For memory tiering mode, cpupid of slow memory page is used - * to record page access time. So use default value. + * For memory tiering mode, last_cpupid is unused. So use default value. */ - if (folio_use_access_time(folio)) + if (folio_is_promo_candidate(folio)) *last_cpupid =3D (-1 & LAST_CPUPID_MASK); else *last_cpupid =3D folio_last_cpupid(folio); @@ -6146,6 +6146,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) int nid =3D NUMA_NO_NODE; bool writable =3D false, ignore_writable =3D false; bool pte_write_upgrade =3D vma_wants_manual_pte_write_upgrade(vma); + bool needs_promotion =3D false; int last_cpupid; int target_nid; pte_t pte, old_pte; @@ -6180,12 +6181,24 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) goto out_map; =20 nid =3D folio_nid(folio); + needs_promotion =3D folio_is_promo_candidate(folio); nr_pages =3D folio_nr_pages(folio); =20 target_nid =3D numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; + + if (needs_promotion) { + /* + * Hot page promotion, mode=3DNUMA_BALANCING_MEMORY_TIERING. + * Isolation and migration are handled by pghot. + */ + nid =3D target_nid; + goto out_map; + } + + /* Balancing b/n toptier nodes, mode=3DNUMA_BALANCING_NORMAL */ if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { flags |=3D TNF_MIGRATE_FAIL; goto out_map; @@ -6225,8 +6238,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) writable); pte_unmap_unlock(vmf->pte, vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, nr_pages, flags); + if (nid !=3D NUMA_NO_NODE) { + if (needs_promotion) + pghot_record_access(folio_pfn(folio), nid, + PGHOT_HINTFAULTS, jiffies); + else + task_numa_fault(last_cpupid, nid, nr_pages, flags); + } return 0; } =20 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4e4421b22b59..aef9bb8a6cd4 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -872,9 +872,6 @@ bool folio_can_map_prot_numa(struct folio *folio, struc= t vm_area_struct *vma, node_is_toptier(nid)) return false; =20 - if (folio_use_access_time(folio)) - folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); - return true; } =20 diff --git a/mm/migrate.c b/mm/migrate.c index 726d27b61a46..a468fa4f7963 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2709,8 +2709,18 @@ int migrate_misplaced_folio_prepare(struct folio *fo= lio, if (!migrate_balanced_pgdat(pgdat, nr_pages)) { int z; =20 - if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)) + /* + * Kswapd wakeup for creating headroom in toptier is done only + * for hot page promotion case and not for misplaced migrations + * between toptier nodes. + * + * In the uncommon case of using NUMA_BALANCING_NORMAL mode + * to balance between lower and higher tier nodes, we end up + * waking the kswapd. + */ + if (node_is_toptier(folio_nid(folio))) return -EAGAIN; + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { if (managed_zone(pgdat->node_zones + z)) break; @@ -2760,6 +2770,8 @@ int migrate_misplaced_folio(struct folio *folio, int = node) #ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); +#endif +#ifdef CONFIG_PGHOT if ((sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && !node_is_toptier(folio_nid(folio)) && node_is_toptier(node)) { @@ -2824,6 +2836,8 @@ int promote_misplaced_memcg_folios(struct list_head *= folio_list, int node) #ifdef CONFIG_NUMA_BALANCING count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); +#endif +#ifdef CONFIG_PGHOT mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)), PGPROMOTE_SUCCESS, nr_succeeded); #endif diff --git a/mm/pghot.c b/mm/pghot.c index 0b31d5917833..1f204a8613eb 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -17,6 +17,9 @@ * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into * their targeted nodes. + * + * Migration rate-limiting and dynamic threshold logic implementations + * were moved from NUMA Balancing mode 2. */ #include #include @@ -32,6 +35,12 @@ unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BA= TCH_NR; =20 unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; =20 +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +static unsigned int sysctl_pghot_promote_rate_limit =3D 65536; + +#define KMIGRATED_MIGRATION_ADJUST_STEPS 16 +#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW 60000 + DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); =20 @@ -45,6 +54,22 @@ static const struct ctl_table pghot_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ZERO, }, + { + .procname =3D "pghot_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, + { + .procname =3D "numa_balancing_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, }; #endif =20 @@ -146,6 +171,110 @@ int pghot_record_access(unsigned long pfn, int nid, i= nt src, unsigned long now) return 0; } =20 +/* + * For memory tiering mode, if there are enough free pages (more than + * enough watermark defined here) in fast memory node, to take full + * advantage of fast memory capacity, all recently accessed slow + * memory pages will be migrated to fast memory node without + * considering hot threshold. + */ +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long enough_wmark; + + enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { + struct zone *zone =3D pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + promo_wmark_pages(zone) + enough_wmark, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsi= gned long rate_limit, + int nr, unsigned int now_ms) +{ + unsigned long nr_cand; + unsigned int start; + + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start =3D pgdat->nbp_rl_start; + if (now_ms - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now_ms) =3D=3D start) + pgdat->nbp_rl_nr_cand =3D nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) + return true; + return false; +} + +static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, unsigned int ref_th, + unsigned int now_ms) +{ + unsigned int start, th_period, unit_th, th; + unsigned long nr_cand, ref_cand, diff_cand; + + th_period =3D KMIGRATED_PROMOTION_THRESHOLD_WINDOW; + start =3D pgdat->nbp_th_start; + if (now_ms - start > th_period && + cmpxchg(&pgdat->nbp_th_start, start, now_ms) =3D=3D start) { + ref_cand =3D rate_limit * + KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC; + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; + unit_th =3D ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS; + th =3D pgdat->nbp_threshold ? : ref_th; + if (diff_cand > ref_cand * 11 / 10) + th =3D max(th - unit_th, unit_th); + else if (diff_cand < ref_cand * 9 / 10) + th =3D min(th + unit_th, ref_th * 2); + pgdat->nbp_th_nr_cand =3D nr_cand; + pgdat->nbp_threshold =3D th; + } +} + +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int ni= d, + unsigned long time) +{ + struct pglist_data *pgdat; + unsigned long rate_limit; + unsigned int th, def_th; + unsigned int now_ms =3D jiffies_to_msecs(jiffies); /* Based on full-width= jiffies */ + unsigned long now =3D jiffies; + + pgdat =3D NODE_DATA(nid); + if (pgdat_free_space_enough(pgdat)) { + /* workload changed, reset hot threshold */ + pgdat->nbp_threshold =3D 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages); + return true; + } + + def_th =3D sysctl_pghot_freq_window; + rate_limit =3D MB_TO_PAGES(sysctl_pghot_promote_rate_limit); + kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms); + + th =3D pgdat->nbp_threshold ? : def_th; + if (pghot_access_latency(time, now) >=3D th) + return false; + + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_m= s); +} + static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, unsigned long *time) { @@ -223,6 +352,11 @@ static void kmigrated_walk_zone(unsigned long start_pf= n, unsigned long end_pfn, goto out_next; } =20 + if (!kmigrated_should_migrate_memory(nr, nid, time)) { + folio_put(folio); + goto out_next; + } + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) { folio_put(folio); goto out_next; diff --git a/mm/vmstat.c b/mm/vmstat.c index 4064ead568cc..da668ff05032 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1268,7 +1268,7 @@ const char * const vmstat_text[] =3D { #ifdef CONFIG_SWAP [I(NR_SWAPCACHE)] =3D "nr_swapcached", #endif -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_PGHOT [I(PGPROMOTE_SUCCESS)] =3D "pgpromote_success", [I(PGPROMOTE_CANDIDATE)] =3D "pgpromote_candidate", [I(PGPROMOTE_CANDIDATE_NRL)] =3D "pgpromote_candidate_nrl", --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from SA9PR02CU001.outbound.protection.outlook.com (mail-southcentralusazon11013023.outbound.protection.outlook.com [40.93.196.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6CD6292B2E for ; Mon, 4 May 2026 06:10:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.93.196.23 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875060; cv=fail; b=jL7GqY5g1PMyFkzMmYXl85B2ygrFLLKeNnzoj28JINh1cJfeU/1EgXjCSD0THv1D6As1OkPvTv9lK++MLLI97pwlnY0ppYxKt4hOKH8VUaR+oRxqHF5jFfk89bh0l1fzNiglsEOeasRoTGqw/+JNJK9DOYqY8dCMilA4SbDftF8= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875060; c=relaxed/simple; bh=h2jYFjdSQjsLEjSO04gSVih9jNkGL7hxogRvyoiFMXs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ayBRO4cNwy3XMRBvU9C6KMYKExTMN0pqxidAIZ2Rdeqw+z/nW2x1DCYwty+66d30X3WQWqSZ10E/HgxR272MlohRO59IQW5wDS1tZIDXb1/T3jFfS096zvk7yjOptF0EkHIByrQDZYDZRoSAPMEs8xc09E56WeljgxNKSI1+Lag= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=P+XkzWeB; arc=fail smtp.client-ip=40.93.196.23 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="P+XkzWeB" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Zk112YuWslE7EM5Nmn8irlt9UcV4+JyZKOzm7dvPG+RLAV+I3V48FGNKBGgKIYB/Xvc5K0AaV+y73fJTxc+OiMHlKO1xtyc6abPFsfHU2SGgD/ra5oSHzGSx29sqOhMeeLW6o+HsfMSpWcJOzUdhixGd84eii4Ffq2K5vvpsuRetjIFjQIDjaBgJ0gFQox/avaH6s0NDniRDhdH9lR/CKKbhU8thsHMJpJ5f7KBvr2zxhIF/hG5xXPLGTQJLEY00RlCI980P9EknU702BV/f4CeiV9dB/vGvWba5mzXdZfFMxfZgR3unUPldpCdnJRTLpFo7k4t1goiZNTXVICmSyw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qYTG/OTUAyUKj0dzPu1bSgtDGQVrm42SWt0V2KXT838=; b=C99+eeoKeHsudtbFwGeras24uOZOdHfBbXG6OYI0hyNgQLJRq0vToCrTxX0e5j3vlZvUluqtU2zzd7AzgQOm2l6SyuX7EYa38GqHEy3d5FYBjRShokeYqxCwARwuTjkp4VRFonCDuBLq0uL7mJjpJZa91PRp4tRmNqxBTvaFgvodk+OZNdv+l5fEyNj+LhXnUxlieeHJnzWII4+CIZF+GPG6IlB+yK/eg9CMXszjC2EVpsZxRk3woOaE2vfHLQAajq5zmO0SdSpKweEM1/vvYoRKh0lkbb/p+d2RP2CVhmZ98B5w4LdtGOm9O0/5fPPGdDnPvvxMosbdthzKXdVULg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qYTG/OTUAyUKj0dzPu1bSgtDGQVrm42SWt0V2KXT838=; b=P+XkzWeB6T/CXBINExpiiwhBt9jaajh10QeO2Ntfqnzconut4y3Wa7Wo+JbNBTNUi3syVlsSWoTd+nheGk++0x55gfPP4z387PgICPYTzAyoVZi6UgzDvBS5tD3AtMPZMXCpltoKi5CqfbzJNCI0sDpHJHNqp/2SJvDxSntM1R4= Received: from CH0PR07CA0027.namprd07.prod.outlook.com (2603:10b6:610:32::32) by LV8PR12MB9136.namprd12.prod.outlook.com (2603:10b6:408:18e::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:10:52 +0000 Received: from CH1PEPF0000AD77.namprd04.prod.outlook.com (2603:10b6:610:32:cafe::32) by CH0PR07CA0027.outlook.office365.com (2603:10b6:610:32::32) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:10:52 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD77.mail.protection.outlook.com (10.167.244.55) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:10:52 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:44 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Date: Mon, 4 May 2026 11:39:23 +0530 Message-ID: <20260504060924.344313-7-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD77:EE_|LV8PR12MB9136:EE_ X-MS-Office365-Filtering-Correlation-Id: c2fc3d3e-332c-45ae-24dc-08dea9a3e571 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700016|82310400026|376014|7416014|1800799024|22082099003|18002099003|56012099003; X-Microsoft-Antispam-Message-Info: EGsF39yZsm1Udj8qqm8QSkq+CGOIcy/VkztoH3IyTmdqG2t7mZwwWZzyr+hcBxe5LuDOoQRl7wH3/hrBoMkgS3cpPjE3mSJHYeDsAtkmnFsH/UgeHyJLta3JQM21/GMakUVCOCEvzyhjI/zYtq1iYNUhhuzmNj2Nd6TnGj1ISmIKKmJfIl9PIrH2vNQ7fYUXR3jz75FJ2O0HCZS1QB0oNQfKS/0wI/HJynnNswsA/tq1Xy/RhJ9bSeUXpICUZuP5LXgu4QEuN9swbiPb5TFY+1PYQ08SWOWtOgGV3UppcANz3NixGNZrgmGk5JARUPSKNYZZvisn80XuM9wYePHw6vc51WFne+YcPliqvXffDRtzKJ/ksw+XJdtNKqaD77QRhUXmEg2FMk4JWY2p4z4CRzPmB9K9+FbC4dm93LJMSc9omls0yWOtzBXxC6eb0fsCtSjRmPA3xaxbCHAWsUgQTCnLjUnOGshBpkyR7FKWT0J1KOutrf5W7QNX/4vtryDlpU/7xxLhZI8JIv8Dqi7FV/5ClH0MUzXxGn1UfEKYUddbMD5bW3D/nq8oercygXHFnKswltTkF8xla1Ey5iMx0E0r4LwfgCPZa0QGRlogo+0U4kHZt4H2asOqKjRmYPREKwszqvYoVzwhdFwf9TIhQajIg8ugzhdKMHQsePDrC4QiIeozyB+eSbIrJ7zFx83NfGgu6tCZZrvWSavs/zZTdK6V3L684Xdt6YW5ufbYeJw= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700016)(82310400026)(376014)(7416014)(1800799024)(22082099003)(18002099003)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: T8/jKitmOrxMkdGHGfcPsLtE7qFKx54oigHH0t9XO1mMw01QRCc1Ku6alhuM5WM9apiXbkJqES1R79TXe3RxA+D6C0tDQDKUCQkczep0dAMBGMz4XFC51lTYIFiHQMADZs10pnb0zBA8rTsh/r47qaFkqoU3sevuxnWLQH0v/NGjCeJjtnkLwQnKOJWnOIs3Puiw0vBnYyEdZSL9PlLy4SgWUm6t67C9wcUh1lMNCY/zP5ZC+CO1nQouKReYyBQRb4INWf83E8CRC+WVKScgl+eKnLUnkBv2ukuoswUHjTlTRQQei+cZtv5Yp0xB84UrEjQb0m1SsocebZpx623R5OJ1MVQdmXpYijXDBG6XFIDb3NJK0wF9itDUos4V2hRF9m3x+iBR1rfIIaK+ftXh4wOo8EPaEKFb/8+9rbu7nLmT1Lta4NeoSKnOh6X73zdE X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:10:52.6723 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: c2fc3d3e-332c-45ae-24dc-08dea9a3e571 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD77.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV8PR12MB9136 Content-Type: text/plain; charset="utf-8" Subsequent patch adds IBS Memory Profiler driver that is independent of the perf subsystem but needs the CPUID 0x8000001B capability bits. Hence move those bit definitions out of asm/perf_event.h into a dedicated header so the new driver can consume them without pulling in perf. Signed-off-by: Bharata B Rao --- arch/x86/include/asm/ibs-caps.h | 85 +++++++++++++++++++++++++++++++ arch/x86/include/asm/perf_event.h | 81 +---------------------------- 2 files changed, 86 insertions(+), 80 deletions(-) create mode 100644 arch/x86/include/asm/ibs-caps.h diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-cap= s.h new file mode 100644 index 000000000000..ddf6c512c8f9 --- /dev/null +++ b/arch/x86/include/asm/ibs-caps.h @@ -0,0 +1,85 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_IBS_CAPS_H +#define _ASM_X86_IBS_CAPS_H + +/* + * IBS cpuid feature detection + */ + +#define IBS_CPUID_FEATURES 0x8000001b + +/* + * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but + * bit 0 is used to indicate the existence of IBS. + */ +#define IBS_CAPS_AVAIL (1U<<0) +#define IBS_CAPS_FETCHSAM (1U<<1) +#define IBS_CAPS_OPSAM (1U<<2) +#define IBS_CAPS_RDWROPCNT (1U<<3) +#define IBS_CAPS_OPCNT (1U<<4) +#define IBS_CAPS_BRNTRGT (1U<<5) +#define IBS_CAPS_OPCNTEXT (1U<<6) +#define IBS_CAPS_RIPINVALIDCHK (1U<<7) +#define IBS_CAPS_OPBRNFUSE (1U<<8) +#define IBS_CAPS_FETCHCTLEXTD (1U<<9) +#define IBS_CAPS_OPDATA4 (1U<<10) +#define IBS_CAPS_ZEN4 (1U<<11) +#define IBS_CAPS_OPLDLAT (1U<<12) +#define IBS_CAPS_DIS (1U<<13) +#define IBS_CAPS_FETCHLAT (1U<<14) +#define IBS_CAPS_BIT63_FILTER (1U<<15) +#define IBS_CAPS_STRMST_RMTSOCKET (1U<<16) +#define IBS_CAPS_OPDTLBPGSIZE (1U<<19) + +#define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ + | IBS_CAPS_FETCHSAM \ + | IBS_CAPS_OPSAM) + +/* + * IBS APIC setup + */ +#define IBSCTL 0x1cc +#define IBSCTL_LVT_OFFSET_VALID (1ULL<<8) +#define IBSCTL_LVT_OFFSET_MASK 0x0F + +/* IBS fetch bits/masks */ +#define IBS_FETCH_L3MISSONLY (1ULL << 59) +#define IBS_FETCH_RAND_EN (1ULL << 57) +#define IBS_FETCH_VAL (1ULL << 49) +#define IBS_FETCH_ENABLE (1ULL << 48) +#define IBS_FETCH_CNT 0xFFFF0000ULL +#define IBS_FETCH_MAX_CNT 0x0000FFFFULL + +#define IBS_FETCH_2_DIS (1ULL << 0) +#define IBS_FETCH_2_FETCHLAT_FILTER (0xFULL << 1) +#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT (1) +#define IBS_FETCH_2_EXCL_RIP_63_EQ_1 (1ULL << 5) +#define IBS_FETCH_2_EXCL_RIP_63_EQ_0 (1ULL << 6) + +/* + * IBS op bits/masks + * The lower 7 bits of the current count are random bits + * preloaded by hardware and ignored in software + */ +#define IBS_OP_LDLAT_EN (1ULL << 63) +#define IBS_OP_LDLAT_THRSH (0xFULL << 59) +#define IBS_OP_LDLAT_THRSH_SHIFT (59) +#define IBS_OP_CUR_CNT (0xFFF80ULL << 32) +#define IBS_OP_CUR_CNT_RAND (0x0007FULL << 32) +#define IBS_OP_CUR_CNT_EXT_MASK (0x7FULL << 52) +#define IBS_OP_CNT_CTL (1ULL << 19) +#define IBS_OP_VAL (1ULL << 18) +#define IBS_OP_ENABLE (1ULL << 17) +#define IBS_OP_L3MISSONLY (1ULL << 16) +#define IBS_OP_MAX_CNT 0x0000FFFFULL +#define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask = */ +#define IBS_OP_MAX_CNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 bi= ts */ +#define IBS_RIP_INVALID (1ULL << 38) + +#define IBS_OP_2_DIS (1ULL << 0) +#define IBS_OP_2_EXCL_RIP_63_EQ_0 (1ULL << 1) +#define IBS_OP_2_EXCL_RIP_63_EQ_1 (1ULL << 2) +#define IBS_OP_2_STRM_ST_FILTER (1ULL << 3) +#define IBS_OP_2_STRM_ST_FILTER_SHIFT (3) + +#endif /* _ASM_X86_IBS_CAPS_H */ diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_= event.h index 752cb319d5ea..655a54c77f4e 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -3,6 +3,7 @@ #define _ASM_X86_PERF_EVENT_H =20 #include +#include =20 /* * Performance event hw details: @@ -620,86 +621,6 @@ struct arch_pebs_cntr_header { */ #define EXT_PERFMON_DEBUG_FEATURES 0x80000022 =20 -/* - * IBS cpuid feature detection - */ - -#define IBS_CPUID_FEATURES 0x8000001b - -/* - * Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but - * bit 0 is used to indicate the existence of IBS. - */ -#define IBS_CAPS_AVAIL (1U<<0) -#define IBS_CAPS_FETCHSAM (1U<<1) -#define IBS_CAPS_OPSAM (1U<<2) -#define IBS_CAPS_RDWROPCNT (1U<<3) -#define IBS_CAPS_OPCNT (1U<<4) -#define IBS_CAPS_BRNTRGT (1U<<5) -#define IBS_CAPS_OPCNTEXT (1U<<6) -#define IBS_CAPS_RIPINVALIDCHK (1U<<7) -#define IBS_CAPS_OPBRNFUSE (1U<<8) -#define IBS_CAPS_FETCHCTLEXTD (1U<<9) -#define IBS_CAPS_OPDATA4 (1U<<10) -#define IBS_CAPS_ZEN4 (1U<<11) -#define IBS_CAPS_OPLDLAT (1U<<12) -#define IBS_CAPS_DIS (1U<<13) -#define IBS_CAPS_FETCHLAT (1U<<14) -#define IBS_CAPS_BIT63_FILTER (1U<<15) -#define IBS_CAPS_STRMST_RMTSOCKET (1U<<16) -#define IBS_CAPS_OPDTLBPGSIZE (1U<<19) - -#define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ - | IBS_CAPS_FETCHSAM \ - | IBS_CAPS_OPSAM) - -/* - * IBS APIC setup - */ -#define IBSCTL 0x1cc -#define IBSCTL_LVT_OFFSET_VALID (1ULL<<8) -#define IBSCTL_LVT_OFFSET_MASK 0x0F - -/* IBS fetch bits/masks */ -#define IBS_FETCH_L3MISSONLY (1ULL << 59) -#define IBS_FETCH_RAND_EN (1ULL << 57) -#define IBS_FETCH_VAL (1ULL << 49) -#define IBS_FETCH_ENABLE (1ULL << 48) -#define IBS_FETCH_CNT 0xFFFF0000ULL -#define IBS_FETCH_MAX_CNT 0x0000FFFFULL - -#define IBS_FETCH_2_DIS (1ULL << 0) -#define IBS_FETCH_2_FETCHLAT_FILTER (0xFULL << 1) -#define IBS_FETCH_2_FETCHLAT_FILTER_SHIFT (1) -#define IBS_FETCH_2_EXCL_RIP_63_EQ_1 (1ULL << 5) -#define IBS_FETCH_2_EXCL_RIP_63_EQ_0 (1ULL << 6) - -/* - * IBS op bits/masks - * The lower 7 bits of the current count are random bits - * preloaded by hardware and ignored in software - */ -#define IBS_OP_LDLAT_EN (1ULL << 63) -#define IBS_OP_LDLAT_THRSH (0xFULL << 59) -#define IBS_OP_LDLAT_THRSH_SHIFT (59) -#define IBS_OP_CUR_CNT (0xFFF80ULL << 32) -#define IBS_OP_CUR_CNT_RAND (0x0007FULL << 32) -#define IBS_OP_CUR_CNT_EXT_MASK (0x7FULL << 52) -#define IBS_OP_CNT_CTL (1ULL << 19) -#define IBS_OP_VAL (1ULL << 18) -#define IBS_OP_ENABLE (1ULL << 17) -#define IBS_OP_L3MISSONLY (1ULL << 16) -#define IBS_OP_MAX_CNT 0x0000FFFFULL -#define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask = */ -#define IBS_OP_MAX_CNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 bi= ts */ -#define IBS_RIP_INVALID (1ULL << 38) - -#define IBS_OP_2_DIS (1ULL << 0) -#define IBS_OP_2_EXCL_RIP_63_EQ_0 (1ULL << 1) -#define IBS_OP_2_EXCL_RIP_63_EQ_1 (1ULL << 2) -#define IBS_OP_2_STRM_ST_FILTER (1ULL << 3) -#define IBS_OP_2_STRM_ST_FILTER_SHIFT (3) - #ifdef CONFIG_X86_LOCAL_APIC extern u32 get_ibs_caps(void); extern int forward_event_to_ibs(struct perf_event *event); --=20 2.34.1 From nobody Fri Jun 12 12:43:35 2026 Received: from CH4PR04CU002.outbound.protection.outlook.com (mail-northcentralusazon11013050.outbound.protection.outlook.com [40.107.201.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 05AF84086A for ; Mon, 4 May 2026 06:11:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.201.50 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875072; cv=fail; b=NsbZ25DET/Z8S4YiSu/SMZgKyX45tpWF/6AHGe98M5KPJqJbXgWUBLnH8oBwkjB0NEWJNnSddkO5tVaxelDM2CkzZbFVwLyyJzrK2jyN7xPPBFOOiIWY5XU1ZE3ARKKyZVjVjGrn9ReHXqm5QIviQek/AtGqoIS41qQ7Mq8U+Y8= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777875072; c=relaxed/simple; bh=cEejh9ZtTfjrnoXrsFt7+3ShKQ40PnR/uBuYzbyUQdU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UQDPWhSyX3lZ/c9M102gPcS0jQGk4PtAn0dAN3/yPL5jf31BDtibWSsyQumX33DfrcnEaPicMX2WvGOJ72yenUDYuu8u6ahW22GVl94Qi4sQ9YjCGTSRnNMyWSSOX11muLZjkXDZczSuq8yrTdMP3MqefEytBm0fG11YRphHrjY= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Q6pJbpyS; arc=fail smtp.client-ip=40.107.201.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Q6pJbpyS" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=H65BvYbBbVY0T7Ty0RHy8vCVOh8YrnkK+DRp0xfkvhK9I63AGsYFeBc4SqKNRqlH9eGAGOETEblfXWNrkHaj9oXO0YuV0FZXPYoWAMIDHJWtRcIm/EUaCkbhz4u3LQ57rL7parf+p5ROwwyTLedEJcv40nSLtQw2S+lL8RZbTseZZYotu1UxKHuaeNT1TLtWim9Hh+eQpq5N63dJ/NThIrp0CsJZLbqFzmUFblf0DWLCde0cbzRXaeFzPF7N1kGI7YM+Chb03IeW9OoXJFZwvwTSstnyc5T7uSijFeAvMPGsNuFuRXsUyDU1SiSB5vDwQmNN4sAuMq/wtJcgVsN/QQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=PfyzHsKdrGWVxDVJC9EADDiZQnMjSh3qHCeriVzIftY=; b=wUUpL8olE/8wvgFaIKdC4wxDG1SuwGOAnjTd7HjTEFkMgzjyfpqmuQmJARG2u+5e+gy6gdnvkZFtU5J/7ctnY7ZOzd1YKjJHKdK39BpHUacSPzSMWrcFt0u3/LnNx4/A5OkcvtdGzBDCMpfu8sjyJ04j1jcCiyQwURlRUGV5GLxXhye+S7Bdm/LQEmjQap/CeIdRcBOULn6aexfRGiTCX4iSjrgEOTGZ+cTcWyxuhN2tLI/0DS+YOwVwo6PQIrAVBH5CreG6OTz3wdGo4Mv5346+X70xzv/I2QxwZB1sU+UQD0efKoiwrbDBkkrCTgPlg20CjN7swr9TjCV6jWrNEA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=PfyzHsKdrGWVxDVJC9EADDiZQnMjSh3qHCeriVzIftY=; b=Q6pJbpyS7cCj8TWaK9LmX0OBIGalc31/t223zXHt0cBsbvkMurJftAhglHOKRMhe3IdWewz4Dw0GYpCUePswCMCBG/rZD35odN7lyZKl5ICf3oIiCl7fcEEn5235pBbE22zP/7w+u4EvRravW1wfZHyLC9q8PTHdoLBJRs/qvFY= Received: from CH0PR08CA0018.namprd08.prod.outlook.com (2603:10b6:610:33::23) by BY5PR12MB4227.namprd12.prod.outlook.com (2603:10b6:a03:206::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 06:11:01 +0000 Received: from CH1PEPF0000AD78.namprd04.prod.outlook.com (2603:10b6:610:33:cafe::53) by CH0PR08CA0018.outlook.office365.com (2603:10b6:610:33::23) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 06:11:00 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by CH1PEPF0000AD78.mail.protection.outlook.com (10.167.244.56) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 06:11:00 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 4 May 2026 01:10:52 -0500 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Date: Mon, 4 May 2026 11:39:24 +0530 Message-ID: <20260504060924.344313-8-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260504060924.344313-1-bharata@amd.com> References: <20260504060924.344313-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CH1PEPF0000AD78:EE_|BY5PR12MB4227:EE_ X-MS-Office365-Filtering-Correlation-Id: 5cc26799-cb6e-4d0b-670a-08dea9a3ea29 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|36860700016|1800799024|376014|7416014|56012099003|22082099003|18002099003; X-Microsoft-Antispam-Message-Info: GdDBFGIRb8UeM6X6A6VMxGtX7Wy2Jk6FL2DyEfv0SwcGgkC2aefrtut5HhiepY8gYLLeP71YUyS6xsNX8F8jVDXSQSa0JgYhXMQd+EzjQGfvlpXLFmFGyYZiUp8YqDRzCkuthhwfy4cuRwRlz6Qvt+LpUaVKZpSR25e3fHeTEbVS2acGhufN7Q5bB+oDgXHJCobX9dcIsYxVOtoR9jOLd4YG7NGpxBMadsvFzijMi+me3VEoEPWt1nA7owS1WAVB7trh7gzotVpOFRlnRSkvXw2NHfifCxEfrY0ZhrK0idcBTrNMr3JWFsGhdEoDXtgDB1yauVljgjICqYpP+4JzHA5xK5mfcHo8NW/zq/XqL4BP+1D+HEshiC9m4CW+S+rrzbzk5EN/gCOopCNu41RNGxV7d9/xFJlgTaI5Gq+6Lb89AKHpJ1mBFxNhHVQMAbLOsulX6qxdkCLZXrdAoItAh6BwDGRRFcqM/9GKkk2BOEXrqS1dwNkBLVlhqZPv7HkT6eCH/KuFcHE+EYnDOZw9/ZaXjAd0NTlYLIzo9IT+dYvbZczbwyQdeWffhp4Pt43KkRCTBV+FTZarC0XLFJi89XETfJ+9GT1tZznrnsQl792N7sD1HUA8ZWyHEY4TrxhuaWOD6QrEW6RbR8VU0GinCGWetytJgcy+OT5vlVaZaKZXHLve+FYrV1+fE3NMQ8EvJR5Oy2uOxl77RdpGvCrYmh2kkbzIHynHi7KjTPrvHU0= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(36860700016)(1800799024)(376014)(7416014)(56012099003)(22082099003)(18002099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: PIbkGhrieQU4IAoKF/F6p14FBXZWpS9q8nLfqUUekMvTjIIMpQ7/rAHulLkR6Bqp8cnuoDmP6tW06n4NXpIScglFyAzR0zc2CSHPVr6bCw8CLpSfCB2p9Uf+UJkNpDgfot8h1uW4ZTv4iN2xAmcv6Zb1L5ATup3/nBqMf5ekfhTjqyLPBGvsne9/lr3E+etp+RzI5o7Glno1zVU6Wk/n8b6UYXJQoUDbqstmZLXzum7zeJeeLzm1I7tJ0RhLEE9x/OQFWFLEGrFBBPZ2xpUORU+1FzSlyMGy4fVsBsvKFuwcVJ7Llmh5+mpwli1V57Asr0QPpuYFVqxoF1gcRGUpaDgvxoz0R9Uyl54Fo5Lm/WwoYYoa0KTDbrFZQdSOIWmXNSBilJ6XZKcl25oPUwdFZbE8+O2nQH+csz86b+hz/Xx3kAoAUJpCDfsB7SI5k32Y X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 06:11:00.5930 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 5cc26799-cb6e-4d0b-670a-08dea9a3ea29 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: CH1PEPF0000AD78.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB4227 Content-Type: text/plain; charset="utf-8" Use IBS (Instruction Based Sampling) Memory Profiler feature present in AMD Zen6 processors for memory access tracking. The access information obtained from IBS Memory Profiler is fed to pghot sub-system for further action using pghot_record_access(PGHOT_HWHINTS, ...) API. IBS Memory Profiler as page hotness source is enabled by the new config option HWMEM_PROFILER and is also gated by the existing pghot_src_hwhints static key set via debugfs. More details about IBS Memory Profiler can be obtained from the AMD document titled "AMD64 Zen6 Instruction Based Sampling (IBS) Extensions and Features". Signed-off-by: Bharata B Rao --- arch/x86/Kconfig | 16 ++ arch/x86/include/asm/ibs-caps.h | 8 + arch/x86/include/asm/ibs-mprof.h | 46 +++++ arch/x86/include/asm/msr-index.h | 8 + arch/x86/mm/Makefile | 1 + arch/x86/mm/ibs-mprof.c | 308 +++++++++++++++++++++++++++++++ include/linux/cpuhotplug.h | 1 + include/linux/vm_event_item.h | 6 + mm/Kconfig | 9 + mm/vmstat.c | 6 + 10 files changed, 409 insertions(+) create mode 100644 arch/x86/include/asm/ibs-mprof.h create mode 100644 arch/x86/mm/ibs-mprof.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 99bb5217649a..f06c0c44ecce 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1514,6 +1514,22 @@ config AMD_MEM_ENCRYPT This requires an AMD processor that supports Secure Memory Encryption (SME). =20 +config AMD_IBS_MEMPROF + bool "AMD IBS Memory Profiler" + depends on X86_64 && CPU_SUP_AMD + depends on PGHOT + select HWMEM_PROFILER + help + Use the AMD Instruction Based Sampling (IBS) Memory Profiler + facility (present on Zen6 and later AMD CPUs) to feed + hardware-observed memory accesses into the pghot subsystem + for hot-page detection and promotion. + + When disabled, no IBS Memory Profiler MSRs are programmed and + the corresponding NMI handler is not installed. + + If unsure, say N. + # Common NUMA Features config NUMA bool "NUMA Memory Allocation and Scheduler Support" diff --git a/arch/x86/include/asm/ibs-caps.h b/arch/x86/include/asm/ibs-cap= s.h index ddf6c512c8f9..1f6c4058a0e3 100644 --- a/arch/x86/include/asm/ibs-caps.h +++ b/arch/x86/include/asm/ibs-caps.h @@ -29,6 +29,7 @@ #define IBS_CAPS_FETCHLAT (1U<<14) #define IBS_CAPS_BIT63_FILTER (1U<<15) #define IBS_CAPS_STRMST_RMTSOCKET (1U<<16) +#define IBS_CAPS_MEM_PROFILER (1U<<18) #define IBS_CAPS_OPDTLBPGSIZE (1U<<19) =20 #define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ @@ -42,6 +43,13 @@ #define IBSCTL_LVT_OFFSET_VALID (1ULL<<8) #define IBSCTL_LVT_OFFSET_MASK 0x0F =20 +/* + * IBS Memprofiler setup + */ +#define IBSCTL_MPROF_LVT_OFFSET_VALID (1ULL << 24) +#define IBSCTL_MPROF_LVT_OFFSET_SHIFT 16 +#define IBSCTL_MPROF_LVT_OFFSET_MASK (0xFULL << IBSCTL_MPROF_LVT_OFFSET_SH= IFT) + /* IBS fetch bits/masks */ #define IBS_FETCH_L3MISSONLY (1ULL << 59) #define IBS_FETCH_RAND_EN (1ULL << 57) diff --git a/arch/x86/include/asm/ibs-mprof.h b/arch/x86/include/asm/ibs-mp= rof.h new file mode 100644 index 000000000000..91b1ce51d667 --- /dev/null +++ b/arch/x86/include/asm/ibs-mprof.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_IBS_MPROF_H +#define _ASM_X86_IBS_MPROF_H + +/* + * All bits are documented here for clarity even if the current + * driver doesn't use all of them. + */ + +/* MSR_AMD64_IBS_MPROF_DATA2 bits */ +#define IBS_MPROF_DATA2_DATASRC_MASK 0x7 +#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH 0xC0 +#define IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT 0x3 +#define IBS_MPROF_DATA2_DATASRC_LCL_CCX 0x1 +#define IBS_MPROF_DATA2_DATASRC_PEER_CCX_NEAR 0x2 +#define IBS_MPROF_DATA2_DATASRC_DRAM 0x3 +#define IBS_MPROF_DATA2_DATASRC_CCX_FAR 0x5 +#define IBS_MPROF_DATA2_DATASRC_EXT_MEM 0x8 +#define IBS_MPROF_DATA2_RMT_NODE BIT_ULL(4) +#define IBS_MPROF_DATA2_RMT_SOCKET BIT_ULL(9) + +/* MSR_AMD64_IBS_MPROF_DATA3 bits */ +#define IBS_MPROF_DATA3_LDOP BIT_ULL(0) +#define IBS_MPROF_DATA3_STOP BIT_ULL(1) +#define IBS_MPROF_DATA3_DCMISS BIT_ULL(7) +#define IBS_MPROF_DATA3_LADDR_VALID BIT_ULL(17) +#define IBS_MPROF_DATA3_PADDR_VALID BIT_ULL(18) +#define IBS_MPROF_DATA3_L2MISS BIT_ULL(20) +#define IBS_MPROF_DATA3_SW_PREFETCH BIT_ULL(21) + +/* MSR_AMD64_IBS_MPROF_CTL bits */ +#define IBS_MPROF_CTL_CNT_CTL BIT_ULL(19) +#define IBS_MPROF_CTL_VAL BIT_ULL(18) +#define IBS_MPROF_CTL_ENABLE BIT_ULL(17) +#define IBS_MPROF_CTL_L3MISSONLY BIT_ULL(16) +#define IBS_MPROF_CTL_MAXCNT_MASK 0x0000FFFFULL +#define IBS_MPROF_CTL_MAXCNT_EXT_MASK (0x7FULL << 20) /* separate upper 7 = bits */ + +/* MSR_AMD64_IBS_MPROF_CTL2 bits */ +#define IBS_MPROF_CTL2_DISABLE BIT_ULL(0) +#define IBS_MPROF_CTL2_EXCLUDE_USER BIT_ULL(1) +#define IBS_MPROF_CTL2_EXCLUDE_KERNEL BIT_ULL(2) + +#define IBS_MPROF_SAMPLE_PERIOD 10000 + +#endif /* _ASM_X86_IBS_MPROF_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-in= dex.h index a14a0f43e04a..c44b68940f43 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -1315,4 +1315,12 @@ * a #GP */ =20 +/* AMD IBS Memory Profiler MSRs */ +#define MSR_AMD64_IBS_MPROF_CTL 0xc0010380 +#define MSR_AMD64_IBS_MPROF_CTL2 0xc0010381 +#define MSR_AMD64_IBS_MPROF_DATA2 0xc0010382 +#define MSR_AMD64_IBS_MPROF_DATA3 0xc0010383 +#define MSR_AMD64_IBS_MPROF_LINADDR 0xc0010384 +#define MSR_AMD64_IBS_MPROF_PHYADDR 0xc0010385 + #endif /* _ASM_X86_MSR_INDEX_H */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 3a5364853eab..050a7379d9f7 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -59,3 +59,4 @@ obj-$(CONFIG_X86_MEM_ENCRYPT) +=3D mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) +=3D mem_encrypt_amd.o =20 obj-$(CONFIG_AMD_MEM_ENCRYPT) +=3D mem_encrypt_boot.o +obj-$(CONFIG_AMD_IBS_MEMPROF) +=3D ibs-mprof.o diff --git a/arch/x86/mm/ibs-mprof.c b/arch/x86/mm/ibs-mprof.c new file mode 100644 index 000000000000..b3d59b21c8c9 --- /dev/null +++ b/arch/x86/mm/ibs-mprof.c @@ -0,0 +1,308 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define pr_fmt(fmt) "amd_ibs_memprof: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#define IBS_NR_SAMPLES 150 /* Percpu sample buffer size */ + +static DEFINE_PER_CPU(bool, mprof_work_pending); + +/* + * Basic access info captured for each memory access. + */ +struct mprof_sample { + unsigned long pfn; + unsigned long time; /* jiffies when accessed */ + int nid; /* Accessing node ID, if known */ +}; + +/* + * Percpu buffer of access samples. Samples are accumulated here + * before pushing them to pghot sub-system for further action. + */ +struct mprof_sample_pcpu { + struct mprof_sample samples[IBS_NR_SAMPLES]; + int head, tail; +}; + +static struct mprof_sample_pcpu __percpu *mprof_s; + +/* + * The workqueue for pushing the percpu access samples to pghot sub-system. + */ +static DEFINE_PER_CPU(struct work_struct, mprof_work); +static DEFINE_PER_CPU(struct irq_work, mprof_irq_work); + +/* + * Record the IBS-reported access sample in percpu buffer. + * Called from IBS NMI handler. + */ +static bool mprof_push_sample(unsigned long pfn, int nid, unsigned long ti= me) +{ + struct mprof_sample_pcpu *pcpu =3D raw_cpu_ptr(mprof_s); + int head =3D READ_ONCE(pcpu->head); + int tail =3D READ_ONCE(pcpu->tail); + int next =3D head + 1; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + if (next =3D=3D tail) + return false; + + pcpu->samples[head].pfn =3D pfn; + pcpu->samples[head].time =3D time; + pcpu->samples[head].nid =3D nid; + + smp_store_release(&pcpu->head, next); + return true; +} + +static bool mprof_pop_sample(struct mprof_sample *s) +{ + struct mprof_sample_pcpu *pcpu =3D raw_cpu_ptr(mprof_s); + int tail =3D READ_ONCE(pcpu->tail); + int head =3D smp_load_acquire(&pcpu->head); + int next =3D tail + 1; + + if (head =3D=3D tail) + return false; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + *s =3D pcpu->samples[tail]; + + WRITE_ONCE(pcpu->tail, next); + return true; +} + +/* + * Remove access samples from percpu buffer and send them + * to pghot sub-system for further action. + */ +static void mprof_work_handler(struct work_struct *work) +{ + struct mprof_sample s; + + while (mprof_pop_sample(&s)) + pghot_record_access(s.pfn, s.nid, PGHOT_HWHINTS, s.time); + + this_cpu_write(mprof_work_pending, false); +} + +static void mprof_irq_handler(struct irq_work *i) +{ + struct work_struct *w =3D this_cpu_ptr(&mprof_work); + + /* + * FIXME: pending samples on a CPU that goes offline before the + * work runs may be lost or migrated to the wrong CPU's ring; + * needs a teardown-time drain. + */ + schedule_work_on(smp_processor_id(), w); +} + +/* + * L3MissOnly + Exclude kernel RIP + */ +static void mprof_enable_profiling(void) +{ + u64 mprof_config =3D IBS_MPROF_CTL_CNT_CTL | IBS_MPROF_CTL_ENABLE | + IBS_MPROF_CTL_L3MISSONLY; + unsigned int period =3D IBS_MPROF_SAMPLE_PERIOD; + u64 ctl, ctl2; + + /* + * Assemble bits 26:20 and 19:4 of periodic op counter in ctl. + * The lower 4 bits are always 0000b. + */ + ctl =3D (period >> 4) & IBS_MPROF_CTL_MAXCNT_MASK; + ctl |=3D (period & IBS_MPROF_CTL_MAXCNT_EXT_MASK); + ctl |=3D mprof_config; + wrmsrq(MSR_AMD64_IBS_MPROF_CTL, ctl); + + /* + * Exclude samples that have bit 63 of their RIP set. + */ + ctl2 =3D IBS_MPROF_CTL2_EXCLUDE_KERNEL; + wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, ctl2); +} + +static void mprof_disable_profiling(u64 mem_ctl) +{ + mem_ctl &=3D ~IBS_MPROF_CTL_ENABLE; + mem_ctl &=3D ~IBS_MPROF_CTL_VAL; + wrmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl); + + wrmsrq(MSR_AMD64_IBS_MPROF_CTL2, IBS_MPROF_CTL2_DISABLE); +} + +/* + * IBS NMI handler: Process the memory access info reported by IBS. + * + * Reads the MSRs to collect all the information about the reported + * memory access, validates the access, stores the valid sample and + * schedules the work on this CPU to further process the sample. + */ +static int mprof_overflow_handler(unsigned int cmd, struct pt_regs *regs) +{ + u64 mem_ctl, mem_data3, mem_data2, paddr, data_src; + unsigned long pfn; + struct page *page; + + rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl); + if (!(mem_ctl & IBS_MPROF_CTL_VAL)) + return NMI_DONE; + + mprof_disable_profiling(mem_ctl); + count_vm_event(HWHINT_TOTAL_EVENTS); + + rdmsrq(MSR_AMD64_IBS_MPROF_DATA3, mem_data3); + rdmsrq(MSR_AMD64_IBS_MPROF_DATA2, mem_data2); + + data_src =3D mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK; + data_src |=3D ((mem_data2 & IBS_MPROF_DATA2_DATASRC_MASK_HIGH) >> + IBS_MPROF_DATA2_DATASRC_MASK_HIGH_SHIFT); + + switch (data_src) { + case IBS_MPROF_DATA2_DATASRC_DRAM: + count_vm_event(HWHINT_DRAM_ACCESSES); + break; + case IBS_MPROF_DATA2_DATASRC_EXT_MEM: + count_vm_event(HWHINT_EXTMEM_ACCESSES); + break; + } + + /* Is linear addr valid? */ + if (!(mem_data3 & IBS_MPROF_DATA3_LADDR_VALID)) + goto handled; + + /* Is phys addr valid? */ + if (!(mem_data3 & IBS_MPROF_DATA3_PADDR_VALID)) + goto handled; + rdmsrq(MSR_AMD64_IBS_MPROF_PHYADDR, paddr); + + pfn =3D PHYS_PFN(paddr); + page =3D pfn_to_online_page(pfn); + if (!page) + goto handled; + + /* + * Use the accessing CPU's node as the migration target. On + * topologies where all CPUs reside on toptier nodes (the common + * case), this is the desired behaviour. Topologies that place + * CPUs on lower-tier nodes are rejected later by + * pghot_record_access() via the src_nid =3D=3D nid early return. + */ + if (!mprof_push_sample(pfn, numa_node_id(), jiffies)) + goto handled; + + if (!this_cpu_read(mprof_work_pending)) { + this_cpu_write(mprof_work_pending, true); + irq_work_queue(this_cpu_ptr(&mprof_irq_work)); + } + count_vm_event(HWHINT_USEFUL_EVENTS); + +handled: + mprof_enable_profiling(); + return NMI_HANDLED; +} + +static int get_mprof_lvt_offset(void) +{ + u64 val; + + rdmsrq(MSR_AMD64_IBSCTL, val); + if (!(val & IBSCTL_MPROF_LVT_OFFSET_VALID)) + return -EINVAL; + + return (val & IBSCTL_MPROF_LVT_OFFSET_MASK) >> + IBSCTL_MPROF_LVT_OFFSET_SHIFT; +} + +static int x86_amd_ibs_mprof_startup(unsigned int cpu) +{ + int offset =3D get_mprof_lvt_offset(); + + if (offset < 0) { + pr_warn("offset not valid on cpu #%d\n", cpu); + return 0; + } + + if (setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_NMI, 0)) { + pr_warn("APIC setup failed on cpu #%d\n", cpu); + return 0; + } + + mprof_enable_profiling(); + return 0; +} + +static int x86_amd_ibs_mprof_teardown(unsigned int cpu) +{ + int offset =3D get_mprof_lvt_offset(); + u64 mem_ctl; + + if (offset >=3D 0) + setup_APIC_eilvt(offset, 0, APIC_DELIVERY_MODE_FIXED, 1); + + rdmsrq(MSR_AMD64_IBS_MPROF_CTL, mem_ctl); + mprof_disable_profiling(mem_ctl); + + return 0; +} + +static int __init mprof_access_profiling_init(void) +{ + u32 mprof_caps =3D cpuid_eax(IBS_CPUID_FEATURES); + int cpu, ret; + + if (!(mprof_caps & IBS_CAPS_MEM_PROFILER)) { + pr_info("capability is unavailable for access profiling\n"); + return 0; + } + + mprof_s =3D alloc_percpu_gfp(struct mprof_sample_pcpu, GFP_KERNEL | __GFP= _ZERO); + if (!mprof_s) { + pr_err("alloc_percpu_gfp failed\n"); + return 0; + } + + for_each_possible_cpu(cpu) { + INIT_WORK(per_cpu_ptr(&mprof_work, cpu), mprof_work_handler); + init_irq_work(per_cpu_ptr(&mprof_irq_work, cpu), mprof_irq_handler); + } + + register_nmi_handler(NMI_LOCAL, mprof_overflow_handler, 0, "ibs-memprof"); + + ret =3D cpuhp_setup_state(CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING, + "x86/amd/ibs_mprof:starting", + x86_amd_ibs_mprof_startup, + x86_amd_ibs_mprof_teardown); + + if (ret) { + unregister_nmi_handler(NMI_LOCAL, "ibs-memprof"); + free_percpu(mprof_s); + pr_err("cpuhp_setup_state failed: %d\n", ret); + } else { + pr_info("IBS Memory Profiler setup for memory access profiling\n"); + } + return 0; +} + +device_initcall(mprof_access_profiling_init); diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 22ba327ec227..feaa3f571726 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -150,6 +150,7 @@ enum cpuhp_state { CPUHP_AP_PERF_X86_AMD_UNCORE_STARTING, CPUHP_AP_PERF_X86_STARTING, CPUHP_AP_PERF_X86_AMD_IBS_STARTING, + CPUHP_AP_MM_AMD_IBS_MEMPROF_STARTING, CPUHP_AP_PERF_XTENSA_STARTING, CPUHP_AP_ARM_VFP_STARTING, CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING, diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 58d510711bd4..a9c04a9735c6 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -179,6 +179,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGHOT_RECORDED_ACCESSES, PGHOT_RECORDED_HINTFAULTS, PGHOT_RECORDED_HWHINTS, +#ifdef CONFIG_HWMEM_PROFILER + HWHINT_TOTAL_EVENTS, + HWHINT_DRAM_ACCESSES, + HWHINT_EXTMEM_ACCESSES, + HWHINT_USEFUL_EVENTS, +#endif /* CONFIG_HWMEM_PROFILER */ #endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; diff --git a/mm/Kconfig b/mm/Kconfig index cc4b5685ecd4..674cfcea7bb0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1494,6 +1494,15 @@ config PGHOT_PRECISE 4 bytes per page against the default one byte per page. Preferable to enable this on systems with multiple nodes in toptier. =20 +config HWMEM_PROFILER + bool + depends on PGHOT + help + Umbrella symbol enabled by any in-kernel driver that forwards + hardware-observed memory accesses to the pghot subsystem (for + example AMD_IBS_MEMPROF on x86_64). Drivers select this; users + do not enable it directly. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/vmstat.c b/mm/vmstat.c index da668ff05032..06e7ae06519e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1493,6 +1493,12 @@ const char * const vmstat_text[] =3D { [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", [I(PGHOT_RECORDED_HINTFAULTS)] =3D "pghot_recorded_hintfaults", [I(PGHOT_RECORDED_HWHINTS)] =3D "pghot_recorded_hwhints", +#ifdef CONFIG_HWMEM_PROFILER + [I(HWHINT_TOTAL_EVENTS)] =3D "hwhint_total_events", + [I(HWHINT_DRAM_ACCESSES)] =3D "hwhint_dram_accesses", + [I(HWHINT_EXTMEM_ACCESSES)] =3D "hwhint_extmem_accesses", + [I(HWHINT_USEFUL_EVENTS)] =3D "hwhint_useful_events", +#endif /* CONFIG_HWMEM_PROFILER */ #endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ --=20 2.34.1