From nobody Sat Feb 7 08:44:11 2026 Received: from PH0PR06CU001.outbound.protection.outlook.com (mail-westus3azon11011045.outbound.protection.outlook.com [40.107.208.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 25AC730BF66 for ; Thu, 29 Jan 2026 14:41:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.208.45 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697710; cv=fail; b=rklB1y7GIjw86Gl4f046nNhc2PopsYjnAz4rI703Ocpb0qsRAuE0/HW0To0EEL+Z0IUz55Cgk8uNXHYy0zzAD3/6uDzGLuarpOxM3FffPhFKXotvZl/yYSM0BbP/0YL0LOPoVGjFTxtmyEO1SFeYYNnvcT3r3SmOiWmY4EC3tVc= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697710; c=relaxed/simple; bh=YXFfIUhwC3aVwA8rXGu1b2pQnHdaRLvyl9lJglTjD1k=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mwW61Q/ZsqWjL627J0dTiYPbjC28EDEAiI6CchqOhD5iBV4AN4kT71F9toZmIabsk7hIuoy/fTwhxFlI+uwJ5omck5A1kacbKoAwiJkCVLlFUoqQDr4fhlqcA6LmdGQ4+RJANfk5k6Tz09q/+oeRBbMR+MVnpwqEqvs/qupJz2E= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=pjckwvg5; arc=fail smtp.client-ip=40.107.208.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="pjckwvg5" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=fNyw/mWkaN8u1fmRxra1a8ttSmIv44ILJRiGC7BDetDPzkIV1QAqpDB3U5qi0LhBDW1nzSPF5XjP1Xms6H83IY/2rmH6J87U1aG73aDyiRBKqRagVDjZcVVuiRVzRfhQbDB44o0at1PnNPhkIDugKahWAu/85it0UfJkTbAFHihOF8kPPeuJJucTgAf/8hkllDUi2iAyR7iLZOv2DH64vEv6nDWuynDL40qLX8bq6wk6F1/GMXe/rSnMOYSQcrupWgEYvWaUpiob9+gVv9JVfwlFCICTj4C7BEUp+it/sMWOkzKdoYu8kV7t6OD0z37vj1rKgSgAfKUQdKiCHklNpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Y3TXnm3YqWala8nESp/YB4DNry5KtX1V6w/Vij00Mnk=; b=L7Z2+BeOjusLxJdT3Y7snAWNeL6EfalNxkZf9Q1dFtTQUaxBPP6bD5fYwAAI+dicWPKOS5GbcnDSud4YIeMmC0LsPWtDk6QRe+6GkzJOh7IfDAp9jQh0fLBrtfISJ2xZoolTP+q94h0dRP3O9LoRB+myZp5Iwk+mY+IPT+jdvfbj4d1LksHIHQJqnfKeFF3k3jFoqK/9QmuIVI2ZkhQu/jOJCgw6KhrFtJD/DaE8sPWGGdyb4c77pT0y0pw2xtHWJuVQN/WmKe/krZBmeSnfGV4ZKvgACUj5UnglfxdLQMXAUBTYe2CxrFAqKeBhWTb/DQkyPJ68gSEiOfzDNRAthw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Y3TXnm3YqWala8nESp/YB4DNry5KtX1V6w/Vij00Mnk=; b=pjckwvg5ezmGTC3Ur6n4RBpm2ihi5qASE52ThW9l3TfSc4TF9WfUtS2YCr+5vYcxuEIl+Zwe1aFxwzMDhkw518/iVWa8nysV9P2hKTAAJ3erXRtzmyWM8cOKaeY4/DbGXnuFLFayu9e84Sa/NT8js3g9uLyDuxQTb6cTOfC1kDA= Received: from BY3PR05CA0055.namprd05.prod.outlook.com (2603:10b6:a03:39b::30) by SN7PR12MB7884.namprd12.prod.outlook.com (2603:10b6:806:343::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:41:42 +0000 Received: from MWH0EPF000971E5.namprd02.prod.outlook.com (2603:10b6:a03:39b:cafe::b9) by BY3PR05CA0055.outlook.office365.com (2603:10b6:a03:39b::30) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9587.4 via Frontend Transport; Thu, 29 Jan 2026 14:41:40 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by MWH0EPF000971E5.mail.protection.outlook.com (10.167.243.73) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.3 via Frontend Transport; Thu, 29 Jan 2026 14:41:42 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:41:33 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 01/10] mm: migrate: Allow misplaced migration without VMA Date: Thu, 29 Jan 2026 20:10:34 +0530 Message-ID: <20260129144043.231636-2-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MWH0EPF000971E5:EE_|SN7PR12MB7884:EE_ X-MS-Office365-Filtering-Correlation-Id: 326c1adc-eb88-431a-0c0e-08de5f4484b6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|1800799024|36860700013|376014|7416014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?MlRnmO0fA+i/nWaeLmWxQQnz3gfYzqEyGWycJb+6CTJtwS7TLcrWNWFYPUZu?= =?us-ascii?Q?K7A5YCPjsj9g3sypufuX27HcPgfH/ZOifnI2iU0jT6aFiAZlgKVax/BizEqp?= =?us-ascii?Q?brum72xvnHUHV9kYTCMnIUR4o6tzwX3WkpuZUK3giiTrZ4FxRNSitN1ZLeGa?= =?us-ascii?Q?vZKuj5JbtIRmhRdJNpraAz9oZfe7jlo7OLRbv/NI20xxgWrk1Yz46dpxSsvg?= =?us-ascii?Q?cudsfIRyX9jNaTHousHcLzhBc5HQ+J77wPUAXQ7hzCUnNqPRM+HWocfDx47e?= =?us-ascii?Q?QnhzxSbqyWJRKD9YYf+PebItrY3iYhQR3KVKyK9Au0uiMF6h3TrRPDDyWUR2?= =?us-ascii?Q?g60tYhOVbzVdm2I+Wyt52QRXdeeYHpadmg6uiFLlXrLGtP2Ap+kpZw5h8gkA?= =?us-ascii?Q?HTWR9WCUmT5FUtv+/qS1U1HVc3XR9GuQtnPGPu7ByN8XdBDLh28Jmi8U3q5t?= =?us-ascii?Q?AZ3yPIGZoAYCkeQr/OUkgvamGLCMqi2RR/QtfKq9Riu+77/GYYosm0BiLDLV?= =?us-ascii?Q?0cY8IRAOrEOGRaFZBjelEeglAJTViVNJzbfn5Eq+h0lsUM9dx0jyUQXUBri/?= =?us-ascii?Q?Spw2awtLDgpR+tqQ9JF6EoKXGDNYn1lLu69eI65tTGtfmLELaVNsDhtSBXJ6?= =?us-ascii?Q?kLg5PcS+OJA8qAGNcWGyndMhmWHWm8wNz6+8IvzNLX6Qeztw0DqmMQc+sZIK?= =?us-ascii?Q?39r6hNhx2C4rxGgaWuDL67UIno6Z/PyKBGtf14l3kMejjRoAKKaIwmqWtyfm?= =?us-ascii?Q?+MWpX31FvKwbORjYUwQvrDg3qoVVzjqXoYVoJ3nPzJSTiRAvDYL+QWNOmcOo?= =?us-ascii?Q?zAnmF/N3eiRlkDAvLy+ywpkTOZUG3kbLsjgibagbC6sOT5RzOQSdrDej86aG?= =?us-ascii?Q?HAITeOewDaqrLRVRmUITyN/Xm8Ne1yC4isu1E3Yxondwg3724KzG9iVxYyTJ?= =?us-ascii?Q?gQtAIkNdCyN+vEGlHKtirzSGZupFL0dHi46UYS8GGssydar0ZkFmdkeIbe4I?= =?us-ascii?Q?l0FXSDuB54BHueBWbJJZCtyOBWWCLAus+7ByjLdNafkTExdyupBMmzR0Whkg?= =?us-ascii?Q?0Vsv29nq0sXc+hQgRN4yPamQ5ZfyMGXqBeEYPnd4ukjD1om8Zv3ZMsPSK879?= =?us-ascii?Q?tdM8yCdvXDl2mXgq31IGsZpWZPqgUFm2YetFePDcj2/1LM91DoXDmCPl82JX?= =?us-ascii?Q?cllwPD9kmcSQPcN9BumryIdQI+G0rjpC3TJb6tIOkFOLqmxEXdgrKsMotA48?= =?us-ascii?Q?CBiSeLRdHEuddY9wGAvjQI3ZdO800c9eLm6IKrLbc8Fg2y36uFV7glMdnuoh?= =?us-ascii?Q?ddVM+1k05pICbQcXPq7wJTZnnLcnkOyBl0mbQ7McZhuSmyRFI+P5+Nid/uBh?= =?us-ascii?Q?ESwqhsX8BL83dXBP1wZ1No5F54ZVCkT0VVBRt9EA1ZnQSyQybnKSBaMpnQdA?= =?us-ascii?Q?5OQCJ4A7UZjX3nGMH7JXECOQPQjI8iUEeGGAc+wV8ids6qE2Q3Neak2Qlxjy?= =?us-ascii?Q?De4/AcMXgUwIsAe0s0n8XjcBI10XO/35RX149v/NflZx0bhhq4Ro2gy2W04T?= =?us-ascii?Q?2hCKRdFlvgXqltNZCiH3ym3GrWIm1R9kJfsrFt0Mx0XXzNLhRaEdDOjBGqqp?= =?us-ascii?Q?SFfhAaBeSHyGfrngL35MZMKjZO2GrSzQcx/5m3Ap9UeoURkJBE886ELuV5Wf?= =?us-ascii?Q?SYmTsg=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(1800799024)(36860700013)(376014)(7416014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:41:42.1340 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 326c1adc-eb88-431a-0c0e-08de5f4484b6 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: MWH0EPF000971E5.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN7PR12MB7884 Content-Type: text/plain; charset="utf-8" We want isolation of misplaced folios to work in contexts where VMA isn't available, typically when performing migrations from a kernel thread context. In order to prepare for that, allow migrate_misplaced_folio_prepare() to be called with a NULL VMA. When migrate_misplaced_folio_prepare() is called with non-NULL VMA, it will check if the folio is mapped shared and that requires holding PTL lock. This path isn't taken when the function is invoked with NULL VMA (migration outside of process context). Therefore, when VMA =3D=3D NULL, migrate_misplaced_folio_prepare() does not require the caller to hold the PTL. Signed-off-by: Bharata B Rao --- mm/migrate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 5169f9717f60..70f8f3ad4fd8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2652,7 +2652,8 @@ static struct folio *alloc_misplaced_dst_folio(struct= folio *src, =20 /* * Prepare for calling migrate_misplaced_folio() by isolating the folio if - * permitted. Must be called with the PTL still held. + * permitted. Must be called with the PTL still held if called with a non-= NULL + * vma. */ int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -2669,7 +2670,7 @@ int migrate_misplaced_folio_prepare(struct folio *fol= io, * See folio_maybe_mapped_shared() on possible imprecision * when we cannot easily detect if a folio is shared. */ - if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) + if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) return -EACCES; =20 /* --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from MW6PR02CU001.outbound.protection.outlook.com (mail-westus2azon11012019.outbound.protection.outlook.com [52.101.48.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10D5E29C321 for ; Thu, 29 Jan 2026 14:42:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.48.19 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697740; cv=fail; b=cZxqeVHDs/Nxu7auVgR8YlcYt8F3xjBF9fCsCUW6+Pu18kdgjsQ82U4yc2rPBIvnONtnFF+sbRlqFeU7NqsD2lrjKVo7GpP0+slMzMuabdRKD+goWyXdystboEAl/eRSK/WOoP9fLhJFzH6CWWJpLZLAIGb+3Y+Ia7yWYNMBKTc= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697740; c=relaxed/simple; bh=lilarUVTQibzLj3Nw0MZxN5cCBAl/vPqQcFXBI4flhY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=otSCmspkCMmoG/HkU+NxdhTXXTKH+GzbnfScbIiq4ktrdbBGwkKxWovLbDzXBLIjbUWBP+4Zd8Pq6g0U17Pq72boo6Y035JW7JNklz0xdSehnj56wtRd6Air5rxCrk3nY9Th1kzc4DOWGhlJVhMiUNXuzQO5AGyIdDrTPcFnh+g= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=rTjKr8ER; arc=fail smtp.client-ip=52.101.48.19 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="rTjKr8ER" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Cgqm3QmZF9AyeXXQPEDkvoWjJuYbBE2Q9ajSLCwH/CupS5g/5fdEye/57BoerOroGJ7CBDCqIE3sdpNvL+0QpQtRf/f1p1ecKgwknwRl4GqSNcQssKrM95/ouCjiv3TF6P2kRYr/tyQO4geMd54a1RQHWkn/5qNxc2mDO4ix6WW4ypOy4kOLM2ggyZeiWjg+6Svjp2wUuoy/k/GQdaL/9S5J+RWDTamH1CFk25RNdRxFfuab42WhhuhSAKPokJKMbxOkCV6SpswbG45NbNdhFlRrt+OgYVN5DUAWjslaLVwlZ+FuGHcnpg12VzLdGqvFErzfKcI3VYzjQMF6MYC7sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=23eiGBoQVc5DDMfaaIZ6YGesXKA5XD1hR5itk5/FwLw=; b=t9fD/bYa6PjTkWpQ/X14F6zR3XhIffJVYJeyXYc1Jen5kfXLQyTdDGIuUAxe7lSisFpxgr45AMVnxl8lRkcl9g5JCLMI4LJmxa4k8T7PRHf2Uu5Nm1xQPmSWyhGpC5eS4P1F0De9gg64PJZIH/K+ZtrMIj7QAV8I1iTO2PE/RbIWqpQNSaXyB1QK4Y4KpuSVz9MUDy/STSxovrAhwXBhvjcZ7SQu9rdGuI5VlAhrm5XymEv5Gear32hXthN47lHu0/eEAiq53HPpkH//wtHvVexOnk69uFlr3wKzaTKSg9V4Yzj0dVuqqffBpoYXw/EwzlKrz6qOHzBOsSdfwosQqA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=23eiGBoQVc5DDMfaaIZ6YGesXKA5XD1hR5itk5/FwLw=; b=rTjKr8ERxN1OyLWV+0J8K9+AhctjmFEL1YL4icKMqxyxB101g+xWVaeCsMMF67E+fQ/xzLiJtRNTx5Ar6b02Dc1dTN1JEOFFfPt+Sd0CbnrFc3GfsZN5sMkTOQD0QgmclolWPQA4TcChw8+T/h2CqJ26UQ/p7mdVhgVWElnnl50= Received: from BY5PR17CA0022.namprd17.prod.outlook.com (2603:10b6:a03:1b8::35) by MN2PR12MB4488.namprd12.prod.outlook.com (2603:10b6:208:24e::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:42:12 +0000 Received: from MWH0EPF000971E3.namprd02.prod.outlook.com (2603:10b6:a03:1b8:cafe::15) by BY5PR17CA0022.outlook.office365.com (2603:10b6:a03:1b8::35) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.10 via Frontend Transport; Thu, 29 Jan 2026 14:42:11 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by MWH0EPF000971E3.mail.protection.outlook.com (10.167.243.70) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.3 via Frontend Transport; Thu, 29 Jan 2026 14:42:11 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:42:00 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 02/10] migrate: Add migrate_misplaced_folios_batch() Date: Thu, 29 Jan 2026 20:10:35 +0530 Message-ID: <20260129144043.231636-3-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MWH0EPF000971E3:EE_|MN2PR12MB4488:EE_ X-MS-Office365-Filtering-Correlation-Id: 7458435f-10da-47c1-c673-08de5f449648 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|7416014|36860700013|82310400026|1800799024; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?Gb/k+LXknIEAlNMeYLRRxocUpehScyMPpxywUCnOzM6mZQJkdhl/DDM40OG8?= =?us-ascii?Q?4vGze/KJeIvd1Io6XqdRC+0/4guM0CPN8p7ZAkZOFcm44SN0AIa2OLjXGZVX?= =?us-ascii?Q?6yk3UHhfqSz3Tm96cZx5uWPcEvZMInxASmozQNntjhIwrhQimfAJmKbR1J2i?= =?us-ascii?Q?Kzrbtm6b5ajPSYTEurSKQT+aZkqy3SpUxcQCkjzhOy1tpAHyz8p5ErfuMePb?= =?us-ascii?Q?wgeAm+c0gcBm3iK/k9Cu1pucV2lUofsnugawKBeOGi8lSUoWn3YojT/BxC4/?= =?us-ascii?Q?IwThiCssSsXTQlU/C9sZprI8zYna9O8Awiyajx3SRij0ooKkQ75Jgp7R5TpW?= =?us-ascii?Q?bjlXb/QHvmtobRqIo85i/LN0acPG3LvZEMYWhx/3ZRLIqEcDCHSt9z5Hni9A?= =?us-ascii?Q?FgSaebqktt6xj85LDeaJ4KWYNy98g0zB8S4eGIFd+4/DJlt5Q/mSoeETrZhp?= =?us-ascii?Q?IAwodDKinwv32cB20PouOZ5N+4p+oOGlao4S4TcccV6kcnl4UrRMTze7nF2t?= =?us-ascii?Q?sL/lBxv+hf8b49JJaSpKRIKZ9LcURqZtcuPNtreoXhmHQv4U9f8HUaALULDJ?= =?us-ascii?Q?Gfkb3hg8LkHYmGe9Q9ddD+dqTu5M1MpUr5xUTSvCqR15bupNyXTQFtPYB2GP?= =?us-ascii?Q?w0UvJq/tmoA9BFuqr/lswaPYzMhXBL+XiQjklR+f3R1590qAwP+cbsHwNgRl?= =?us-ascii?Q?NGSuHn3arYoX5moDcfi8w/YL19PUenYYnx4tBjR+mR8fG+46ey9oEhXNd8jF?= =?us-ascii?Q?MgSJFEAKSf+OF1vtOyHHjTg4bB57dJcVySTqb93VRiyj8gflbnOotWlDofCZ?= =?us-ascii?Q?1q/vOD09DuRg+9ZBjOIvop3NQU/uVTXADS0cxnolb5q4jVRnbR1jlzd9HgP7?= =?us-ascii?Q?z0t3svlTd301VX2iTKFfDvF7QLTjJrcWqE7zimEVbkA6cqk2GDu9oSVR9oKP?= =?us-ascii?Q?F+tV6r9nDHkq8VzDDXKF+DabE7MQyolK1ramZm8r+jizijl8MX8J0Hl0K/rx?= =?us-ascii?Q?inFUe/MK7LSQ8tff1nTS+4HxM1Vef7Z1AMOyQJavp7lcrXFWDnv6od6hmmI/?= =?us-ascii?Q?fh0TPRVjJfn8pFnfO00/Vcm72rwj6iTRHjvy6TL7CQX2NxnqQIxHZP94/npA?= =?us-ascii?Q?bHIY9YQS5UwWn3cxMYXSVIIsE0f1ob57vhFJsr/tcWwDZuMjGpmGS+2IA9Om?= =?us-ascii?Q?p5lxD88UIAgf7cVTY66UvyBSibs8Tndi1lTO/Qo0z401L3E2vRHJzKf2jV8i?= =?us-ascii?Q?1bmjt/VDNuscwhqq6K0721WLuq05eohJTdGZnmC4kOoD/G2+iDfF5cySlrvj?= =?us-ascii?Q?3Pv2j3TjrVO52UOdYEtVm1a9ZntR+bS4T9CnkP3NBdEv+ehEOW8avmdYLA85?= =?us-ascii?Q?9zKmIoAW9+TRYO3xlrMXomr8P2bKpOjNlU2gZVJ7LD6eYdJ7gQp1fURPAIwB?= =?us-ascii?Q?U4L/TZD+WVlDOf4llZ8jBQock5FIhRjqunoSa1u3ovEL9QP46bkdo+7HIn1Q?= =?us-ascii?Q?8yWOIJZDdxHOHe+YiIte8Rp4Rcnrgh+JldHleeXGjIYp7kaYodjFQh8PZE7F?= =?us-ascii?Q?2EqQm3DGm0HGnUqEdbD/jcU4YMx9TkkNZcvVraWtavTEpuJIVujI+x6lR7gl?= =?us-ascii?Q?aoDuMByt7JTgTASamEQrQ/ryadrZlzgtQuQoSEJVqnI/BJI3KB++NMVzOhd9?= =?us-ascii?Q?BAXHbw=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(376014)(7416014)(36860700013)(82310400026)(1800799024);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:42:11.5615 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 7458435f-10da-47c1-c673-08de5f449648 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: MWH0EPF000971E3.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB4488 Content-Type: text/plain; charset="utf-8" From: Gregory Price Tiered memory systems often require migrating multiple folios at once. Currently, migrate_misplaced_folio() handles only one folio per call, which is inefficient for batch operations. This patch introduces migrate_misplaced_folios_batch(), a batch variant that leverages migrate_pages() internally for improved performance. The caller must isolate folios beforehand using migrate_misplaced_folio_prepare(). On return, the folio list will be empty regardless of success or failure. This function will be used by pghot kmigrated thread. Signed-off-by: Gregory Price [Rewrote commit description] Signed-off-by: Bharata B Rao --- include/linux/migrate.h | 6 ++++++ mm/migrate.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 42 insertions(+) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 26ca00c325d9..f28326b88592 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -103,6 +103,7 @@ static inline int set_movable_ops(const struct movable_= operations *ops, enum pag int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node); #else static inline int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -113,6 +114,11 @@ static inline int migrate_misplaced_folio(struct folio= *folio, int node) { return -EAGAIN; /* can't migrate now */ } +static inline int migrate_misplaced_folios_batch(struct list_head *folio_l= ist, + int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_MIGRATION diff --git a/mm/migrate.c b/mm/migrate.c index 70f8f3ad4fd8..4a3a9a4ff435 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2747,5 +2747,41 @@ int migrate_misplaced_folio(struct folio *folio, int= node) BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } + +/** + * migrate_misplaced_folios_batch() - Batch variant of migrate_misplaced_f= olio. + * Attempts to migrate a folio list to the specified destination. + * @folio_list: Isolated list of folios to be batch-migrated. + * @node: The NUMA node ID to where the folios should be migrated. + * + * Caller is expected to have isolated the folios by calling + * migrate_misplaced_folio_prepare(), which will result in an + * elevated reference count on the folio. + * + * This function will un-isolate the folios, drop the elevated reference + * and remove them from the list before returning. + * + * Return: 0 on success and -EAGAIN on failure or partial migration. + * On return, @folio_list will be empty regardless of success/fail= ure. + */ +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) +{ + pg_data_t *pgdat =3D NODE_DATA(node); + unsigned int nr_succeeded =3D 0; + int nr_remaining; + + nr_remaining =3D migrate_pages(folio_list, alloc_misplaced_dst_folio, + NULL, node, MIGRATE_ASYNC, + MR_NUMA_MISPLACED, &nr_succeeded); + if (nr_remaining) + putback_movable_pages(folio_list); + + if (nr_succeeded) { + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); + } + WARN_ON(!list_empty(folio_list)); + return nr_remaining ? -EAGAIN : 0; +} #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from CH5PR02CU005.outbound.protection.outlook.com (mail-northcentralusazon11012071.outbound.protection.outlook.com [40.107.200.71]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1457A2D7384 for ; Thu, 29 Jan 2026 14:42:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.200.71 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697774; cv=fail; b=oU0y8IN9c41RIgAZR9NmfJ3a0vlA3XT4t4pYA3IB+wDWcBPpC5KDH/7LvxO63Cp2eFDV+iOobQR4mtlIDMUd36c4OGsWk1Dz3yIbP8a8O3orW9lTgRzZprudwtozaKJcB6qDc1FSwsd676fZ7Xhlf6xGHz3m8nYfDy+Cfnv+FN0= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697774; c=relaxed/simple; bh=s1bIFn/6YK9n5ObODiXM9Z/2kIgOzKaBE/EMKqRlh0Y=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=h7PeY1q1g73RNYBM5f660yVOqyxophi7Zg/r9jIZ4KtEJoWxcnuJCAgIF6pAOGws4PY520FNiYR3BTAlG2HTXDbzjm3udfwTYRXVDLPwDL3X72ZxE3ClPddZrikpuBMGg6nA65s6wgaqrG4+WM3JQgpyd5Sc1ikiBEXFfQW2JUI= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Jx42MoEt; arc=fail smtp.client-ip=40.107.200.71 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Jx42MoEt" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=vKMrtuA82xDCgPFVZeTyiME58waQxubsYwgFH5ESr8CiueaMplRxMsMdLxY3CSx4Cop3IAyYTI6xiL0gg8t4xeIIuSeOt1kJc9wn54yH5dZ55mShMhMS5uZ+TouRxbqcRiKfwz/5DsVscz8/Im/DrkOxTaTFWTIfG82oDWOgBJRVUtKI+TInPETYGakcAKTLNZlNv5XmkZdclCVtubGEnxwLxKbC7pG78VzlC5qAKofJ0/sc7yRVAaEwf0y9Q1/NsgIhIy36nMTR3PmkykN1hLjXwh8vtqy9m+DTitVwZT3d3NxauKjDgVlJTut5U+hg5++FmgR7uo3njHQ85jWZpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ZPKPRhqVubgT7XuLatmNVZ4xiYAilvCkHbTukwGlSMk=; b=KJR/SBTXh9bgH36enoZN2e1liPZ4Ulhb5p455XJe2U+IGoi7ymS0MhMIgkgmBe5m2VVDCYJOQXZqrVqGDZlm7v6vWzjhLhLAKPTu1V8JNIdaBDOFMaRCm9/Whr+41P9JZk3js3hXOTd2AzoBnt2KjN2r7U+3TGxThbajjh//YwPuG15BnB3Exk28tFXnBH4gRdFWuBAor259HpeESXUh58tS9YDu3zDi6UXqWr/wV27FDTLYw9RKIOwcGjRG9jk838N5yHyom2fLPLIyZre3Hhu1mV9yCrHxL9x3fNRstLVdHBI3IzAnyT3FCC4APZ+JC0PQXsJNggbAVXfCnwwfYA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ZPKPRhqVubgT7XuLatmNVZ4xiYAilvCkHbTukwGlSMk=; b=Jx42MoEtiZHer0s0Raz6YSafj2pTAGKja2oLt4VLLITKJfgukgocgsT2USHDAGnE2KrjLDuixLt2bHWY3ZJ5DPbdB/4Hfc9xjeIfIeUbd30jwDhd+dqZBJPUyorqxu6D3IUM1h/PzSUYWuovdD6uW0sTSJJUDXBpO7vBoleeUKs= Received: from SJ0PR13CA0173.namprd13.prod.outlook.com (2603:10b6:a03:2c7::28) by SJ0PR12MB8114.namprd12.prod.outlook.com (2603:10b6:a03:4e8::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:42:37 +0000 Received: from MWH0EPF000971E2.namprd02.prod.outlook.com (2603:10b6:a03:2c7:cafe::55) by SJ0PR13CA0173.outlook.office365.com (2603:10b6:a03:2c7::28) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.8 via Frontend Transport; Thu, 29 Jan 2026 14:42:39 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by MWH0EPF000971E2.mail.protection.outlook.com (10.167.243.69) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.3 via Frontend Transport; Thu, 29 Jan 2026 14:42:37 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:42:28 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 03/10] mm: Hot page tracking and promotion Date: Thu, 29 Jan 2026 20:10:36 +0530 Message-ID: <20260129144043.231636-4-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MWH0EPF000971E2:EE_|SJ0PR12MB8114:EE_ X-MS-Office365-Filtering-Correlation-Id: 0d948487-2963-4c2a-d04b-08de5f44a596 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700013|82310400026|376014|7416014|1800799024; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?C8c+y7W5nwk5Uspqd6FvRx4qwF0cUEG8b0cIVc2bOCzPgw8v1m9Fh7EZpQT4?= =?us-ascii?Q?Y/SC4S/60d4/QQ1RDqnzFbilmFevG56QVfgCQPOyFgKHKmKlFPVhF6JQ3ODJ?= =?us-ascii?Q?nCd8AiZce0VAxm+USnAIFDm4rQ2mnvGLDEQkbDMtwrLhOE+kA42BvvhHVzvu?= =?us-ascii?Q?u/Dsrqe8uPy/6Frs7q0ecvWtBphB8fl2ImHYPNYejLGt7zgbjkNvONdhPKAf?= =?us-ascii?Q?6/oNbRsG7zrdNnU8FuivL9GsghNNZcuxzVDujhIAnK8ky0ZxGtwU1bcDB0z2?= =?us-ascii?Q?yPsRZ7J6kKiMf2UMGM7xEy+3YpR5rozyTgKa+dNZyohNSHWM6d8KTzzpnYI6?= =?us-ascii?Q?gUqxYFqybjhRB2EVPFq1uOlnhj2CzG0YN7/+1BqrZMX678ME7TLM84ipFK03?= =?us-ascii?Q?6wWCvH1ApVDP5MchlDiQ6loohDYSYVL3ELG5k70NBxYgm/oKxKydY7uet649?= =?us-ascii?Q?Mgd07ytpAvFMEWM1aKSCX5tG6FLb7YgSfnqtK4hvIqikpbMV4a78twMpMU3D?= =?us-ascii?Q?J+AeOZXMlFwiwIhPaA5Z8ApUQROdTR/3tV3+hgDqa6k5JgTwIZXeshtDR75i?= =?us-ascii?Q?meDVTlldvtuMZafgNx2z/O9iWNMCK9V2jsUYHHPGvO1B3D8sZy8fZR/LXrVz?= =?us-ascii?Q?efCIShc3oA4Q0xcCIc2VX4hp7kUzd++ThW8JsMplJXjYxv/fiDu0R2FwAYcE?= =?us-ascii?Q?XBEnCPRRE5mt79OGW6P8u+BO2HRaMajI/L3STkAUEAcP/X62OsQ7EEVLqcpT?= =?us-ascii?Q?pKeGjfPXNwnrO8HptI4Wd1fY+w7PyuvRqbqJE2pDq2gtMl5gdGaM6Lkz9XjT?= =?us-ascii?Q?i/K4SlcEVgMUf+OPGCJBKaCdOjGtIAjJ8ckCSYCq3aItBJ8H15eXGT2z3OuM?= =?us-ascii?Q?nNPUpVKDoezk0mBrA2p14D62YT9TlnvwCBECJfW7fYzgpdnFP6e7kK7GSMHy?= =?us-ascii?Q?Pw8bkIOu7TN1qMVi7LJYwyKkUPoFRTEEmIZ3MhVX8RcPNCNpUFzmj4XX8ER4?= =?us-ascii?Q?9NQCr2NTsN1R5K/zicRjJlPTjF4RQWozz9lyxjW3kH5qCNpVlONCwPCMLHMo?= =?us-ascii?Q?wn/kctmJZm/mN8ZyZ6poFJRP2KyxJ08F3Z/Y6p4vLPaFdxBHvRZd1vGBoAJA?= =?us-ascii?Q?YkvC9pXPcECGV5Ph+e6uiwggl+3MzsAkmUbNiAQFQ4aaxmQt+byoCkCJvF4p?= =?us-ascii?Q?RvWxZOod5xdwiy10GWaFbX83YAG5us+hyeSLSrKyCZuQVAYy7pgscTSzKhwh?= =?us-ascii?Q?5y/bTt1hyPCs9kegahRlhYA8Jdweq3JwptDfPqnwTJtGkmlEdv4jmGXaZfIa?= =?us-ascii?Q?F0D6WsgRDf71k18C4grxyDFOM72u8yUCerfLQcjhnX8RkYFclmnuQUuAtOGR?= =?us-ascii?Q?XT59GZOhaqgyS7yWSMkPWloCkqW3i8B/Nkio5QCGAuKh7vkZ4eYHg0DyNKlE?= =?us-ascii?Q?reA4ewkciertTQIlrj/wDgKMGLAvCn7BmlaAk42A1pfqvh6LpkWxjRWZYOzC?= =?us-ascii?Q?QhhHamyJ9EF8If1/Bd8lywpFFjdB48kz3yFQjRr8ulRAlBB5mMVms5Z59qCv?= =?us-ascii?Q?MaQWwJl1ShZs9S0lvmPnfaK/cLeM3B1ykJYUeSZjdglg7UkWh+44lxL9UnI7?= =?us-ascii?Q?Ipm9hpjoz643CgkvyZdEbh56hkRry7pyzDNLm5UyDxCab/Co9svjWpFMe8tK?= =?us-ascii?Q?XTL+ew=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700013)(82310400026)(376014)(7416014)(1800799024);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:42:37.2615 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 0d948487-2963-4c2a-d04b-08de5f44a596 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: MWH0EPF000971E2.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR12MB8114 Content-Type: text/plain; charset="utf-8" This introduces a subsystem for collecting memory access information from different sources. It maintains the hotness information based on the access history and time of access. Additionally, it provides per-lower-tier-node kernel threads (named kmigrated) that periodically promote the pages that are eligible for promotion. Sub-systems that generate hot page access info can report that using this API: int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long time) @pfn: The PFN of the memory accessed @nid: The accessing NUMA node ID @src: The temperature source (subsystem) that generated the access info @time: The access time in jiffies Some temperature sources may not provide the nid from which the page was accessed. This is true for sources that use page table scanning for PTE Accessed bit. For such sources, a configurable/default toptier node is used as promotion target. The hotness information is stored for every page of lower tier memory in a u8 variable (1 byte) that is part of mem_section data structure. kmigrated is a per-lower-tier-node kernel thread that migrates the folios marked for migration in batches. Each kmigrated thread walks the PFN range spanning its node and checks for potential migration candidates. A bunch of tunables for enabling different hotness sources, setting target_nid, frequency threshold are provided in debugfs. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 84 ++++++ include/linux/mmzone.h | 21 ++ include/linux/pghot.h | 94 +++++++ include/linux/vm_event_item.h | 6 + mm/Kconfig | 14 + mm/Makefile | 1 + mm/mm_init.c | 10 + mm/pghot-default.c | 73 +++++ mm/pghot-tunables.c | 189 +++++++++++++ mm/pghot.c | 370 +++++++++++++++++++++++++ mm/vmstat.c | 6 + 11 files changed, 868 insertions(+) create mode 100644 Documentation/admin-guide/mm/pghot.txt create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot-default.c create mode 100644 mm/pghot-tunables.c create mode 100644 mm/pghot.c diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt new file mode 100644 index 000000000000..01291b72e7ab --- /dev/null +++ b/Documentation/admin-guide/mm/pghot.txt @@ -0,0 +1,84 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D +PGHOT: Hot Page Tracking Tunables +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D +The PGHOT subsystem tracks frequently accessed pages in lower-tier memory = and +promotes them to faster tiers. It uses per-PFN hotness metadata and asynch= ronous +migration via per-node kernel threads (kmigrated). + +This document describes tunables available via **debugfs** and **sysctl** = for +PGHOT. + +Debugfs Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Path: /sys/kernel/debug/pghot/ + +1. **enabled_sources** + - Bitmask to enable/disable hotness sources. + - Bits: + - 0: Hardware hints (value 0x1) + - 1: Page table scan (value 0x2) + - 2: Hint faults (value 0x4) + - Default: 0 (disabled) + - Example: + # echo 0x7 > /sys/kernel/debug/pghot/enabled_sources + Enables all sources. + +2. **target_nid** + - Toptier NUMA node ID to which hot pages should be promoted when source + does not provide nid. Used when hotness source can't provide accessing + NID or when the tracking mode is default. + - Default: 0 + - Example: + # echo 1 > /sys/kernel/debug/pghot/target_nid + +3. **freq_threshold** + - Minimum access frequency before a page is marked ready for promotion. + - Range: 1 to 3 + - Default: 2 + - Example: + # echo 3 > /sys/kernel/debug/pghot/freq_threshold + +4. **kmigrated_sleep_ms** + - Sleep interval (ms) for kmigrated thread between scans. + - Default: 100 + +5. **kmigrated_batch_nr** + - Maximum number of folios migrated in one batch. + - Default: 512 + +Sysctl Interface +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +1. pghot_promote_freq_window_ms + +Path: /proc/sys/vm/pghot_promote_freq_window_ms + +- Controls the time window (in ms) for counting access frequency. A page is + considered hot only when **freq_threshold** number of accesses occur with + this time period. +- Default: 4000 (4 seconds) +- Example: + # sysctl vm.pghot_promote_freq_window_ms=3D3000 + +Vmstat Counters +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Following vmstat counters provide some stats about pghot subsystem. + +Path: /proc/vmstat + +1. **pghot_recorded_accesses** + - Number of total hot page accesses recorded by pghot. + +2. **pghot_recorded_hwhints** + - Number of recorded accesses reported by hwhints source. + +3. **pghot_recorded_pgtscans** + - Number of recorded accesses reported by PTE A-bit based source. + +4. **pghot_recorded_hintfaults** + - Number of recorded accesses reported by NUMA Balancing based + hotness source. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 75ef7c9f9307..22e08befb096 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1064,6 +1064,7 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_KMIGRATED_ACTIVATE, /* activates kmigrated */ }; =20 enum zone_flags { @@ -1518,6 +1519,10 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif +#ifdef CONFIG_PGHOT + struct task_struct *kmigrated; + wait_queue_head_t kmigrated_wait; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) @@ -1916,12 +1921,28 @@ struct mem_section { unsigned long section_mem_map; =20 struct mem_section_usage *usage; +#ifdef CONFIG_PGHOT + /* + * Per-PFN hotness data for this section. + * Array of phi_t (u8 in default mode). + * LSB is used as PGHOT_SECTION_HOT_BIT flag. + */ + void *hot_map; +#endif #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; +#endif + /* + * Padding to maintain consistent mem_section size when exactly + * one of PGHOT or PAGE_EXTENSION is enabled. This ensures + * optimal alignment regardless of configuration. + */ +#if (defined(CONFIG_PGHOT) && !defined(CONFIG_PAGE_EXTENSION)) || \ + (!defined(CONFIG_PGHOT) && defined(CONFIG_PAGE_EXTENSION)) unsigned long pad; #endif /* diff --git a/include/linux/pghot.h b/include/linux/pghot.h new file mode 100644 index 000000000000..88e57aab697b --- /dev/null +++ b/include/linux/pghot.h @@ -0,0 +1,94 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PGHOT_H +#define _LINUX_PGHOT_H + +/* Page hotness temperature sources */ +enum pghot_src { + PGHOT_HW_HINTS, + PGHOT_PGTABLE_SCAN, + PGHOT_HINT_FAULT, +}; + +#ifdef CONFIG_PGHOT +#include + +extern unsigned int pghot_target_nid; +extern unsigned int pghot_src_enabled; +extern unsigned int pghot_freq_threshold; +extern unsigned int kmigrated_sleep_ms; +extern unsigned int kmigrated_batch_nr; +extern unsigned int sysctl_pghot_freq_window; + +void pghot_debug_init(void); + +DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); +DECLARE_STATIC_KEY_FALSE(pghot_src_pgtscans); +DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults); + +/* + * Bit positions to enable individual sources in pghot/records_enabled + * of debugfs. + */ +enum pghot_src_enabled { + PGHOT_HWHINTS_BIT =3D 0, + PGHOT_PGTSCAN_BIT, + PGHOT_HINTFAULT_BIT, + PGHOT_MAX_BIT +}; + +#define PGHOT_HWHINTS_ENABLED BIT(PGHOT_HWHINTS_BIT) +#define PGHOT_PGTSCAN_ENABLED BIT(PGHOT_PGTSCAN_BIT) +#define PGHOT_HINTFAULT_ENABLED BIT(PGHOT_HINTFAULT_BIT) +#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_MAX_BIT - 1, 0) + +#define PGHOT_DEFAULT_FREQ_THRESHOLD 2 + +#define KMIGRATED_DEFAULT_SLEEP_MS 100 +#define KMIGRATED_DEFAULT_BATCH_NR 512 + +#define PGHOT_DEFAULT_NODE 0 + +#define PGHOT_DEFAULT_FREQ_WINDOW (4 * MSEC_PER_SEC) + +/* + * Bits 0-6 are used to store frequency and time. + * Bit 7 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 7 + +#define PGHOT_FREQ_WIDTH 2 +/* Bucketed time is stored in 5 bits which can represent up to 4s with HZ= =3D1000 */ +#define PGHOT_TIME_BUCKETS_WIDTH 7 +#define PGHOT_TIME_WIDTH 5 +#define PGHOT_NID_WIDTH 10 + +#define PGHOT_FREQ_SHIFT 0 +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) +#define PGHOT_TIME_BUCKETS_MASK (PGHOT_TIME_MASK << PGHOT_TIME_BUCKETS_WI= DTH) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u8 phi_t; + +#define PGHOT_RECORD_SIZE sizeof(phi_t) + +#define PGHOT_SECTION_HOT_BIT 0 +#define PGHOT_SECTION_HOT_MASK BIT(PGHOT_SECTION_HOT_BIT) + +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime); +bool pghot_update_record(phi_t *phi, int nid, unsigned long now); +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time); + +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now); +#else +static inline int pghot_record_access(unsigned long pfn, int nid, int src,= unsigned long now) +{ + return 0; +} +#endif /* CONFIG_PGHOT */ +#endif /* _LINUX_PGHOT_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 92f80b4d69a6..5b8fd93b55fd 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -188,6 +188,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSTACK_REST, #endif #endif /* CONFIG_DEBUG_STACK_USAGE */ +#ifdef CONFIG_PGHOT + PGHOT_RECORDED_ACCESSES, + PGHOT_RECORD_HWHINTS, + PGHOT_RECORD_PGTSCANS, + PGHOT_RECORD_HINTFAULTS, +#endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/Kconfig b/mm/Kconfig index bd0ea5454af8..f4f0147faac5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1464,6 +1464,20 @@ config PT_RECLAIM config FIND_NORMAL_PAGE def_bool n =20 +config PGHOT + bool "Hot page tracking and promotion" + def_bool n + depends on NUMA && MIGRATION && SPARSEMEM && MMU + help + A sub-system to track page accesses in lower tier memory and + maintain hot page information. Promotes hot pages from lower + tiers to top tier by using the memory access information provided + by various sources. Asynchronous promotion is done by per-node + kernel threads. + + This adds 1 byte of metadata overhead per page in lower-tier + memory nodes. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 2d0570a16e5b..655a27f3a215 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_PT_RECLAIM) +=3D pt_reclaim.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o diff --git a/mm/mm_init.c b/mm/mm_init.c index fc2a6f1e518f..64109feaa1c3 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1401,6 +1401,15 @@ static void pgdat_init_kcompactd(struct pglist_data = *pgdat) static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} #endif =20 +#ifdef CONFIG_PGHOT +static void pgdat_init_kmigrated(struct pglist_data *pgdat) +{ + init_waitqueue_head(&pgdat->kmigrated_wait); +} +#else +static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {} +#endif + static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { int i; @@ -1410,6 +1419,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + pgdat_init_kmigrated(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/pghot-default.c b/mm/pghot-default.c new file mode 100644 index 000000000000..e0a3b2ed2592 --- /dev/null +++ b/mm/pghot-default.c @@ -0,0 +1,73 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Default mode + * + * 1 byte hotness record per PFN. + * Bucketed time and frequency tracked as part of the record. + * Promotion to @pghot_target_nid by default. + */ + +#include +#include + +/* + * @time is regular time, @old_time is bucketed time. + */ +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + time &=3D PGHOT_TIME_BUCKETS_MASK; + old_time <<=3D PGHOT_TIME_BUCKETS_WIDTH; + + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_BUCKETS_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time; + phi_t time =3D now >> PGHOT_TIME_BUCKETS_WIDTH; + + old_hotness =3D READ_ONCE(*phi); + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, now) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D pghot_target_nid; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c new file mode 100644 index 000000000000..79afbcb1e4f0 --- /dev/null +++ b/mm/pghot-tunables.c @@ -0,0 +1,189 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot tunables in debugfs + */ +#include +#include +#include + +static struct dentry *debugfs_pghot; +static DEFINE_MUTEX(pghot_tunables_lock); + +static ssize_t pghot_freq_th_write(struct file *filp, const char __user *u= buf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int freq; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &freq)) + return -EINVAL; + + if (!freq || freq > PGHOT_FREQ_MAX) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_freq_threshold =3D freq; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_freq_th_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_freq_threshold); + return 0; +} + +static int pghot_freq_th_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_freq_th_show, NULL); +} + +static const struct file_operations pghot_freq_th_fops =3D { + .open =3D pghot_freq_th_open, + .write =3D pghot_freq_th_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static ssize_t pghot_target_nid_write(struct file *filp, const char __user= *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int nid; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 10, &nid)) + return -EINVAL; + + if (nid > PGHOT_NID_MAX || !node_online(nid) || !node_is_toptier(nid)) + return -EINVAL; + mutex_lock(&pghot_tunables_lock); + pghot_target_nid =3D nid; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_target_nid_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_target_nid); + return 0; +} + +static int pghot_target_nid_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_target_nid_show, NULL); +} + +static const struct file_operations pghot_target_nid_fops =3D { + .open =3D pghot_target_nid_open, + .write =3D pghot_target_nid_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +static void pghot_src_enabled_update(unsigned int enabled) +{ + unsigned int changed =3D pghot_src_enabled ^ enabled; + + if (changed & PGHOT_HWHINTS_ENABLED) { + if (enabled & PGHOT_HWHINTS_ENABLED) + static_branch_enable(&pghot_src_hwhints); + else + static_branch_disable(&pghot_src_hwhints); + } + + if (changed & PGHOT_PGTSCAN_ENABLED) { + if (enabled & PGHOT_PGTSCAN_ENABLED) + static_branch_enable(&pghot_src_pgtscans); + else + static_branch_disable(&pghot_src_pgtscans); + } + + if (changed & PGHOT_HINTFAULT_ENABLED) { + if (enabled & PGHOT_HINTFAULT_ENABLED) + static_branch_enable(&pghot_src_hintfaults); + else + static_branch_disable(&pghot_src_hintfaults); + } +} + +static ssize_t pghot_src_enabled_write(struct file *filp, const char __use= r *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[16]; + unsigned int enabled; + + if (cnt > 15) + cnt =3D 15; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + buf[cnt] =3D '\0'; + + if (kstrtouint(buf, 0, &enabled)) + return -EINVAL; + + if (enabled & ~PGHOT_SRC_ENABLED_MASK) + return -EINVAL; + + mutex_lock(&pghot_tunables_lock); + pghot_src_enabled_update(enabled); + pghot_src_enabled =3D enabled; + mutex_unlock(&pghot_tunables_lock); + + *ppos +=3D cnt; + return cnt; +} + +static int pghot_src_enabled_show(struct seq_file *m, void *v) +{ + seq_printf(m, "%d\n", pghot_src_enabled); + return 0; +} + +static int pghot_src_enabled_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, pghot_src_enabled_show, NULL); +} + +static const struct file_operations pghot_src_enabled_fops =3D { + .open =3D pghot_src_enabled_open, + .write =3D pghot_src_enabled_write, + .read =3D seq_read, + .llseek =3D seq_lseek, + .release =3D seq_release, +}; + +void pghot_debug_init(void) +{ + debugfs_pghot =3D debugfs_create_dir("pghot", NULL); + debugfs_create_file("enabled_sources", 0644, debugfs_pghot, NULL, + &pghot_src_enabled_fops); + debugfs_create_file("target_nid", 0644, debugfs_pghot, NULL, + &pghot_target_nid_fops); + debugfs_create_file("freq_threshold", 0644, debugfs_pghot, NULL, + &pghot_freq_th_fops); + debugfs_create_u32("kmigrated_sleep_ms", 0644, debugfs_pghot, + &kmigrated_sleep_ms); + debugfs_create_u32("kmigrated_batch_nr", 0644, debugfs_pghot, + &kmigrated_batch_nr); +} diff --git a/mm/pghot.c b/mm/pghot.c new file mode 100644 index 000000000000..95b5012d5b99 --- /dev/null +++ b/mm/pghot.c @@ -0,0 +1,370 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Maintains information about hot pages from slower tier nodes and + * promotes them. + * + * Per-PFN hotness information is stored for lower tier nodes in + * mem_section. + * + * In the default mode, a single byte (u8) is used to store + * the frequency of access and last access time. Promotions are done + * to a default toptier NID. + * + * A kernel thread named kmigrated is provided to migrate or promote + * the hot pages. kmigrated runs for each lower tier node. It iterates + * over the node's PFNs and migrates pages marked for migration into + * their targeted nodes. + */ +#include +#include +#include +#include + +unsigned int pghot_target_nid =3D PGHOT_DEFAULT_NODE; +unsigned int pghot_src_enabled; +unsigned int pghot_freq_threshold =3D PGHOT_DEFAULT_FREQ_THRESHOLD; +unsigned int kmigrated_sleep_ms =3D KMIGRATED_DEFAULT_SLEEP_MS; +unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BATCH_NR; + +unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; + +DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); +DEFINE_STATIC_KEY_FALSE(pghot_src_pgtscans); +DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); + +#ifdef CONFIG_SYSCTL +static const struct ctl_table pghot_sysctls[] =3D { + { + .procname =3D "pghot_promote_freq_window_ms", + .data =3D &sysctl_pghot_freq_window, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, +}; +#endif + +static bool kmigrated_started __ro_after_init; + +/** + * pghot_record_access() - Record page accesses from lower tier memory + * for the purpose of tracking page hotness and subsequent promotion. + * + * @pfn: PFN of the page + * @nid: Unused + * @src: The identifier of the sub-system that reports the access + * @now: Access time in jiffies + * + * Updates the frequency and time of access and marks the page as + * ready for migration if the frequency crosses a threshold. The pages + * marked for migration are migrated by kmigrated kernel thread. + * + * Return: 0 on success and -EINVAL on failure to record the access. + */ +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long= now) +{ + struct mem_section *ms; + struct folio *folio; + phi_t *phi, *hot_map; + struct page *page; + + if (!kmigrated_started) + return -EINVAL; + + if (nid >=3D PGHOT_NID_MAX) + return -EINVAL; + + switch (src) { + case PGHOT_HW_HINTS: + if (!static_branch_likely(&pghot_src_hwhints)) + return -EINVAL; + count_vm_event(PGHOT_RECORD_HWHINTS); + break; + case PGHOT_PGTABLE_SCAN: + if (!static_branch_likely(&pghot_src_pgtscans)) + return -EINVAL; + count_vm_event(PGHOT_RECORD_PGTSCANS); + break; + case PGHOT_HINT_FAULT: + if (!static_branch_likely(&pghot_src_hintfaults)) + return -EINVAL; + count_vm_event(PGHOT_RECORD_HINTFAULTS); + break; + default: + return -EINVAL; + } + + /* + * Record only accesses from lower tiers. + */ + if (node_is_toptier(pfn_to_nid(pfn))) + return 0; + + /* + * Reject the non-migratable pages right away. + */ + page =3D pfn_to_online_page(pfn); + if (!page || is_zone_device_page(page)) + return 0; + + folio =3D page_folio(page); + if (!folio_test_lru(folio)) + return 0; + + /* Get the hotness slot corresponding to the 1st PFN of the folio */ + pfn =3D folio_pfn(folio); + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + return -EINVAL; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + count_vm_event(PGHOT_RECORDED_ACCESSES); + + /* + * Update the hotness parameters. + */ + if (pghot_update_record(phi, nid, now)) { + set_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot_map); + set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags); + } + return 0; +} + +static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, + unsigned long *time) +{ + phi_t *phi, *hot_map; + struct mem_section *ms; + + ms =3D __pfn_to_section(pfn); + if (!ms || !ms->hot_map) + return -EINVAL; + + hot_map =3D (phi_t *)(((unsigned long)(ms->hot_map)) & ~PGHOT_SECTION_HOT= _MASK); + phi =3D &hot_map[pfn % PAGES_PER_SECTION]; + + return pghot_get_record(phi, nid, freq, time); +} + +/* + * Walks the PFNs of the zone, isolates and migrates them in batches. + */ +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end= _pfn, + int src_nid) +{ + int cur_nid =3D NUMA_NO_NODE; + LIST_HEAD(migrate_list); + int batch_count =3D 0; + struct folio *folio; + struct page *page; + unsigned long pfn; + + pfn =3D start_pfn; + do { + int nid =3D NUMA_NO_NODE, nr =3D 1; + int freq =3D 0; + unsigned long time =3D 0; + + if (!pfn_valid(pfn)) + goto out_next; + + page =3D pfn_to_online_page(pfn); + if (!page) + goto out_next; + + folio =3D page_folio(page); + nr =3D folio_nr_pages(folio); + if (folio_nid(folio) !=3D src_nid) + goto out_next; + + if (!folio_test_lru(folio)) + goto out_next; + + if (pghot_get_hotness(pfn, &nid, &freq, &time)) + goto out_next; + + if (nid =3D=3D NUMA_NO_NODE) + nid =3D pghot_target_nid; + + if (folio_nid(folio) =3D=3D nid) + goto out_next; + + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) + goto out_next; + + if (cur_nid =3D=3D NUMA_NO_NODE) + cur_nid =3D nid; + + /* If NID changed, flush the previous batch first */ + if (cur_nid !=3D nid) { + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + cur_nid =3D nid; + batch_count =3D 0; + cond_resched(); + } + + list_add(&folio->lru, &migrate_list); + + if (++batch_count > kmigrated_batch_nr) { + migrate_misplaced_folios_batch(&migrate_list, cur_nid); + batch_count =3D 0; + cond_resched(); + } +out_next: + pfn +=3D nr; + } while (pfn < end_pfn); + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, cur_nid); +} + +static void kmigrated_do_work(pg_data_t *pgdat) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); + /* s_begin =3D first_present_section_nr(); */ + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + start_pfn =3D section_nr_to_pfn(section_nr); + ms =3D __nr_to_section(section_nr); + + if (!pfn_valid(start_pfn)) + continue; + + nid =3D pfn_to_nid(start_pfn); + if (node_is_toptier(nid) || nid !=3D pgdat->node_id) + continue; + + if (!test_and_clear_bit(PGHOT_SECTION_HOT_BIT, (unsigned long *)&ms->hot= _map)) + continue; + + kmigrated_walk_zone(start_pfn, start_pfn + PAGES_PER_SECTION, + pgdat->node_id); + } +} + +static inline bool kmigrated_work_requested(pg_data_t *pgdat) +{ + return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags); +} + +/* + * Per-node kthread that iterates over its PFNs and migrates the + * pages that have been marked for migration. + */ +static int kmigrated(void *p) +{ + long timeout =3D msecs_to_jiffies(kmigrated_sleep_ms); + pg_data_t *pgdat =3D p; + + while (!kthread_should_stop()) { + if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(p= gdat), + timeout)) + kmigrated_do_work(pgdat); + } + return 0; +} + +static int kmigrated_run(int nid) +{ + pg_data_t *pgdat =3D NODE_DATA(nid); + int ret; + + if (node_is_toptier(nid)) + return 0; + + if (!pgdat->kmigrated) { + pgdat->kmigrated =3D kthread_create_on_node(kmigrated, pgdat, nid, + "kmigrated%d", nid); + if (IS_ERR(pgdat->kmigrated)) { + ret =3D PTR_ERR(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret); + return ret; + } + pr_info("pghot: Started kmigrated thread for node %d\n", nid); + } + wake_up_process(pgdat->kmigrated); + return 0; +} + +static void pghot_free_hot_map(void) +{ + unsigned long section_nr, s_begin; + struct mem_section *ms; + + /* s_begin =3D first_present_section_nr(); */ + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + kfree(ms->hot_map); + } +} + +static int pghot_alloc_hot_map(void) +{ + unsigned long section_nr, s_begin, start_pfn; + struct mem_section *ms; + int nid; + + /* s_begin =3D first_present_section_nr(); */ + s_begin =3D next_present_section_nr(-1); + for_each_present_section_nr(s_begin, section_nr) { + ms =3D __nr_to_section(section_nr); + start_pfn =3D section_nr_to_pfn(section_nr); + nid =3D pfn_to_nid(start_pfn); + + if (node_is_toptier(nid) || !pfn_valid(start_pfn)) + continue; + + ms->hot_map =3D kcalloc_node(PAGES_PER_SECTION, PGHOT_RECORD_SIZE, GFP_K= ERNEL, + nid); + if (!ms->hot_map) + goto out_free_hot_map; + } + return 0; + +out_free_hot_map: + pghot_free_hot_map(); + return -ENOMEM; +} + +static int __init pghot_init(void) +{ + pg_data_t *pgdat; + int nid, ret; + + ret =3D pghot_alloc_hot_map(); + if (ret) + return ret; + + for_each_node_state(nid, N_MEMORY) { + ret =3D kmigrated_run(nid); + if (ret) + goto out_stop_kthread; + } + register_sysctl_init("vm", pghot_sysctls); + pghot_debug_init(); + + kmigrated_started =3D true; + return 0; + +out_stop_kthread: + for_each_node_state(nid, N_MEMORY) { + pgdat =3D NODE_DATA(nid); + if (pgdat->kmigrated) { + kthread_stop(pgdat->kmigrated); + pgdat->kmigrated =3D NULL; + } + } + pghot_free_hot_map(); + return ret; +} + +late_initcall_sync(pghot_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index 65de88cdf40e..f6f91b9dd887 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1501,6 +1501,12 @@ const char * const vmstat_text[] =3D { [I(KSTACK_REST)] =3D "kstack_rest", #endif #endif +#ifdef CONFIG_PGHOT + [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", + [I(PGHOT_RECORD_HWHINTS)] =3D "pghot_recorded_hwhints", + [I(PGHOT_RECORD_PGTSCANS)] =3D "pghot_recorded_pgtscans", + [I(PGHOT_RECORD_HINTFAULTS)] =3D "pghot_recorded_hintfaults", +#endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from BL2PR02CU003.outbound.protection.outlook.com (mail-eastusazon11011008.outbound.protection.outlook.com [52.101.52.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4D2ED1A3154 for ; Thu, 29 Jan 2026 14:43:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.52.8 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697796; cv=fail; b=j9Np/P4tl3DMqwGR22KSePIgxL85LwLeSZvQMWQQpYENvUn5OklhsDiEtaPojagQJ5GDDGADNeMi9SnLGyk9QtPYivn9gTIvNqYuT597ISadghYEmJWnGrztMS2A0ZRhnSYdFLz3ulmRg1s6p62XR3lxsrLBrlZ1bFV5ijzI0II= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697796; c=relaxed/simple; bh=/+CLx4NoYX2HXrukK4eL8bnmybryEkxfDqN+TVXDzxY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=a3DLNp40pxM2lGan3gaozEsXR1Ssm95POSb3xdh6TnoVWAlLlweidVw7OVqEiLqIsi1f30ij30WGlF0FX/jfs4rTRvqs/LgnowjjbJcaRccpEQKtdoz1miJnoGOmj6hlKkvUsDuWCAr6h7W7sqy8VwwBdC2Y6i7Uv57Tk+pwduk= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=vGgzP618; arc=fail smtp.client-ip=52.101.52.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="vGgzP618" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Oh+Kx9pwZjx/qs1dPF4ADu+m0/yq3HY7RD6FLPgpm8VPziPAsRHB2PNULTYU2uHXQ/N4Z6o+LgcPXHX4MpbOPkq4kk912FUFHR3Tm0I5zW7eBAAM/jnmTTIScgvjEEnyRGoUuByqIP/CYFcwmKfxPdLRsv2ihI4Jp4zcEwGZ/h792gjwSGo1D6rC5AfT2FFUSMZFJhcdnpVhGCWgn/jfaFBeRJlC/jx8Ogvwu4C3WEqywGwYpzNnRnOMCbhbtO/F7YMs+CvK+cYCJDpMy5Wbf3gn5CmtEjn9Ag+xlgSRhw21q+ZtLwhemtJLR/QyLraFqO0tuTHWzCGY2Bka30deMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vtGPbM3cYUrnp6QDYMV1xn9+hixoSjQgngEFAtBt310=; b=xHcp4JjNflYIPLNUDZK56gn9EM8kIhtqJ+zyB3RIK01eEG7pGNMi3siwnuJzCijaEXWIfSfDC2M/7qwFNBFzs+IpnQsy7sBGUlM8ZMhkQn6M+3TkCH5pymdWhyB//+odmauDkliQPfUInSuMud2ShWc8pEnjUoMpH1lOtRLepOn66GPqwprcHGc/rALuPZmaotINTzfH6NGmpDmPZskWIyxf6/tLYH1RCpwaPuVkOzrrxU3HkPkdOkyIy4tBpFFsIIJzvRnh851b0dZnwv0C/mH/bjKgJQylDkO7M4IjDFSnpf4fh5l5VzkTiwcvpswEvjoVbQZVCWaHv3Qu+q4IkQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=vtGPbM3cYUrnp6QDYMV1xn9+hixoSjQgngEFAtBt310=; b=vGgzP618GUmyznTqsRpJG2YaP8hkmFxydhWHGvZBSn1taiWnpxKCSxeJE3iZ53u5uMBOFhcsolnW1k9TD7ZcrXT6ArYISM+Fk4S/akPpCX6DDpjfETjmPUaL/JZTtssngvEqUD98+D+GKaNySkT4bDtO+DgbGhUo39FUlVHmU1g= Received: from MW4P222CA0008.NAMP222.PROD.OUTLOOK.COM (2603:10b6:303:114::13) by MN2PR12MB4192.namprd12.prod.outlook.com (2603:10b6:208:1d5::15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:43:09 +0000 Received: from MWH0EPF000971E5.namprd02.prod.outlook.com (2603:10b6:303:114:cafe::79) by MW4P222CA0008.outlook.office365.com (2603:10b6:303:114::13) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.7 via Frontend Transport; Thu, 29 Jan 2026 14:42:54 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by MWH0EPF000971E5.mail.protection.outlook.com (10.167.243.73) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.3 via Frontend Transport; Thu, 29 Jan 2026 14:43:08 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:42:58 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 04/10] mm: pghot: Precision mode for pghot Date: Thu, 29 Jan 2026 20:10:37 +0530 Message-ID: <20260129144043.231636-5-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: MWH0EPF000971E5:EE_|MN2PR12MB4192:EE_ X-MS-Office365-Filtering-Correlation-Id: b7c2b319-2463-45c7-57d6-08de5f44b86d X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|376014|7416014|1800799024|36860700013; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?Vn2iSVkv56czQ1mRkCE00pX66ytFXVviLpTm+EWDnIGhDu4BTorZH1ZRfpiO?= =?us-ascii?Q?ugPz6uDer1BcYMOrFhENpdym9U0z7W5oQOFuSY6WVlGoaUeupmdt5XDcAia5?= =?us-ascii?Q?RZIba2Q02exM6lUTs9iKszQKUgWgTZjnjdYHzc+Pklah+KwAIIWMfAtMTgCS?= =?us-ascii?Q?499feAGtMJWEAW0wirTAgEnvIyiFPUTSJB5LOB2PxxnjWsbh7CU4QGr81/9R?= =?us-ascii?Q?1fw1ucj+1+V5TvgBkS5XgsZ0QauZ7BKdVEXUWOZB8ciFUQs4juFWloQyfuQw?= =?us-ascii?Q?ggQUDDyz+qBWTTKMQnKdmXnaKalgdQ+uHkNzhGrOlIUa6ipacl4uXIZp0oYn?= =?us-ascii?Q?c0EqdzgoxnHbAEZDRyP6c3zRwU1agIbYmWO5S4ENpN60O4W96UdwoE4VM/aI?= =?us-ascii?Q?cLPJFd3FeOYh+/osQqs1DGD2n52bT1kuysnrMj4Jqn3LXB1JsAD6Da2/YMDc?= =?us-ascii?Q?VknUJOe8cDUyCjBFvBWr7u/ttesvXyx8qVYQo6VxeCC2NZo+JQu6064wRO4s?= =?us-ascii?Q?ctsfAb7ARlm5U/4XD4C8TnmuJdm7MDoSLaicgh/WgpuzWql6ZNWiE0kx8SIz?= =?us-ascii?Q?l8Egdslju69NBp/6wjbyfTkFt43XXImp/640Vh0wQczFCjqS7zJ2ujPL656x?= =?us-ascii?Q?ar0qZuxViYBUp1FVM0ex01QPxlDUoX7eng7eQ2EilhCojGJKYAqt9qxxCpQN?= =?us-ascii?Q?//9myh/hQRbecsCTkVARv3RebMnFUtgKBG1B9FO0c5jkzo/lugzYJ4gbFT50?= =?us-ascii?Q?RfKdLVNL2ulNDfKlsOhaeKJJF0nWoTcuvdQMY61O7dicfiW4IXEU6dhZgEst?= =?us-ascii?Q?RyVSWFMr3r6vKzzvZJb1o/nDPo3lgliSQEVF4NoUqJpujvDwfwlJHEsMHuf1?= =?us-ascii?Q?FAmZgXpzsrKMmMmMb2pvLslaDyhlbxvIqGMR8S6EmZ9yoH2XeYT2J82IJnmP?= =?us-ascii?Q?uhkckNu71iNok96GM+45G6OJRCGKjCxrOredH241IdQwpQk28rXIztk55YRx?= =?us-ascii?Q?I/DtAkQT2nVO9+nkTLOlTP+SaAlV14pcMFcYpL+IHiu9uz8zFaz6fHdCtifw?= =?us-ascii?Q?pINeKENZvwsN4XByd3WxE0OWaK8lCudOHgdUOu7CC0bXIlONokOCEXDysehj?= =?us-ascii?Q?+4c1tSDSqqU+UodjQhDFipQt9Ue0M9gY+EarqdjeexgHqAwAMplMyogRgmHf?= =?us-ascii?Q?Kd2L2UEzystw3N1MWtFYuAjtKN5dyIOg6zQ06IjxLgIxdIjVxedH+90w3/Z6?= =?us-ascii?Q?TrrhLe9ZjWWWBXro9SqFCElyH9zw9t3LO5X6R1v1nbXp2bXjKrAQFbzMNDXr?= =?us-ascii?Q?bdKUPvN8E3fbNbioglCu9RzBteyWW3F1deqhzQGxr6/60g8KQ/cd4lcJYxbz?= =?us-ascii?Q?Xqn6/2TQyrk/KydwSCH4Xz6hJWdOgyuYPjBx4OT5UkS6XvzKIV/uajFtTOtd?= =?us-ascii?Q?9zL/6z6XOCchGYz+RYpu/04rUn3OGfPJgGuciogQo6rxz+DD+kceG2P0AEMz?= =?us-ascii?Q?F3YJ+BZ8eDYAptICyELzMXVdfxGtHqahScLdMz8W50y6vDL8TQT7NcjaCEW7?= =?us-ascii?Q?xAIcbhxvXzqUYzWiyb1QrGr3ZIsvkslSLcga/5rhJtA66vlEHDXJ1FOw/nJw?= =?us-ascii?Q?ZX/7opMjzquuQVGidjO06/7XDdImYeeYc2+tRV/TXzdJuj5NRh1M7OIxaOcW?= =?us-ascii?Q?Abngnw=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(376014)(7416014)(1800799024)(36860700013);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:43:08.8374 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b7c2b319-2463-45c7-57d6-08de5f44b86d X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: MWH0EPF000971E5.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB4192 Content-Type: text/plain; charset="utf-8" By default, one byte per PFN is used to store hotness information. Limited number of bits are used to store the access time leading to coarse-grained time tracking. Also there aren't enough bits to track the toptier NID explicitly and hence the default target_nid is used for promotion. This precise mode relaxes the above situation by storing the hotness information in 4 bytes per PFN. More fine-grained access time tracking and toptier NID tracking becomes possible in this mode. Typically useful when toptier consists of more than one node. Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 4 +- include/linux/mmzone.h | 2 +- include/linux/pghot.h | 31 ++++++++++++ mm/Kconfig | 11 ++++ mm/Makefile | 7 ++- mm/pghot-precise.c | 70 ++++++++++++++++++++++++++ mm/pghot.c | 13 +++-- 7 files changed, 130 insertions(+), 8 deletions(-) create mode 100644 mm/pghot-precise.c diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt index 01291b72e7ab..b329e692ef89 100644 --- a/Documentation/admin-guide/mm/pghot.txt +++ b/Documentation/admin-guide/mm/pghot.txt @@ -38,7 +38,7 @@ Path: /sys/kernel/debug/pghot/ =20 3. **freq_threshold** - Minimum access frequency before a page is marked ready for promotion. - - Range: 1 to 3 + - Range: 1 to 3 in default mode, 1 to 7 in precision mode. - Default: 2 - Example: # echo 3 > /sys/kernel/debug/pghot/freq_threshold @@ -60,7 +60,7 @@ Path: /proc/sys/vm/pghot_promote_freq_window_ms - Controls the time window (in ms) for counting access frequency. A page is considered hot only when **freq_threshold** number of accesses occur with this time period. -- Default: 4000 (4 seconds) +- Default: 4000 (4 seconds) in default mode and 5000 (5s) in precision mod= e. - Example: # sysctl vm.pghot_promote_freq_window_ms=3D3000 =20 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 22e08befb096..49c374064fc2 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1924,7 +1924,7 @@ struct mem_section { #ifdef CONFIG_PGHOT /* * Per-PFN hotness data for this section. - * Array of phi_t (u8 in default mode). + * Array of phi_t (u8 in default mode, u32 in precision mode). * LSB is used as PGHOT_SECTION_HOT_BIT flag. */ void *hot_map; diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 88e57aab697b..d3d59b0c0cf6 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -48,6 +48,36 @@ enum pghot_src_enabled { =20 #define PGHOT_DEFAULT_NODE 0 =20 +#if defined(CONFIG_PGHOT_PRECISE) +#define PGHOT_DEFAULT_FREQ_WINDOW (5 * MSEC_PER_SEC) + +/* + * Bits 0-26 are used to store nid, frequency and time. + * Bits 27-30 are unused now. + * Bit 31 is used to indicate the page is ready for migration. + */ +#define PGHOT_MIGRATE_READY 31 + +#define PGHOT_NID_WIDTH 10 +#define PGHOT_FREQ_WIDTH 3 +/* time is stored in 14 bits which can represent up to 16s with HZ=3D1000 = */ +#define PGHOT_TIME_WIDTH 14 + +#define PGHOT_NID_SHIFT 0 +#define PGHOT_FREQ_SHIFT (PGHOT_NID_SHIFT + PGHOT_NID_WIDTH) +#define PGHOT_TIME_SHIFT (PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH) + +#define PGHOT_NID_MASK GENMASK(PGHOT_NID_WIDTH - 1, 0) +#define PGHOT_FREQ_MASK GENMASK(PGHOT_FREQ_WIDTH - 1, 0) +#define PGHOT_TIME_MASK GENMASK(PGHOT_TIME_WIDTH - 1, 0) + +#define PGHOT_NID_MAX ((1 << PGHOT_NID_WIDTH) - 1) +#define PGHOT_FREQ_MAX ((1 << PGHOT_FREQ_WIDTH) - 1) +#define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) + +typedef u32 phi_t; + +#else /* !CONFIG_PGHOT_PRECISE */ #define PGHOT_DEFAULT_FREQ_WINDOW (4 * MSEC_PER_SEC) =20 /* @@ -74,6 +104,7 @@ enum pghot_src_enabled { #define PGHOT_TIME_MAX ((1 << PGHOT_TIME_WIDTH) - 1) =20 typedef u8 phi_t; +#endif /* CONFIG_PGHOT_PRECISE */ =20 #define PGHOT_RECORD_SIZE sizeof(phi_t) =20 diff --git a/mm/Kconfig b/mm/Kconfig index f4f0147faac5..fde5aee3e16f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1478,6 +1478,17 @@ config PGHOT This adds 1 byte of metadata overhead per page in lower-tier memory nodes. =20 +config PGHOT_PRECISE + bool "Hot page tracking precision mode" + def_bool n + depends on PGHOT + help + Enables precision mode for tracking hot pages with pghot sub-system. + Adds fine-grained access time tracking and explicit toptier target + NID tracking. Precise hot page tracking comes at the cost of using + 4 bytes per page against the default one byte per page. Preferable + to enable this on systems with multiple nodes in toptier. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 655a27f3a215..89f999647752 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -147,4 +147,9 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_PT_RECLAIM) +=3D pt_reclaim.o -obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o pghot-default.o +obj-$(CONFIG_PGHOT) +=3D pghot.o pghot-tunables.o +ifdef CONFIG_PGHOT_PRECISE +obj-$(CONFIG_PGHOT) +=3D pghot-precise.o +else +obj-$(CONFIG_PGHOT) +=3D pghot-default.o +endif diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c new file mode 100644 index 000000000000..d8d4f15b3f9f --- /dev/null +++ b/mm/pghot-precise.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * pghot: Precision mode + * + * 4 byte hotness record per PFN (u32) + * NID, time and frequency tracked as part of the record. + */ + +#include +#include + +unsigned long pghot_access_latency(unsigned long old_time, unsigned long t= ime) +{ + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK); +} + +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) +{ + phi_t freq, old_freq, hotness, old_hotness, old_time, old_nid; + phi_t time =3D now & PGHOT_TIME_MASK; + + old_hotness =3D READ_ONCE(*phi); + do { + bool new_window =3D false; + + hotness =3D old_hotness; + old_nid =3D (hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK; + old_freq =3D (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + old_time =3D (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + + if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window) + new_window =3D true; + + if (new_window) + freq =3D 1; + else if (old_freq < PGHOT_FREQ_MAX) + freq =3D old_freq + 1; + else + freq =3D old_freq; + nid =3D (nid =3D=3D NUMA_NO_NODE) ? pghot_target_nid : nid; + + hotness &=3D ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT); + hotness &=3D ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT); + hotness &=3D ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT); + + hotness |=3D (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT; + hotness |=3D (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT; + hotness |=3D (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT; + + if (freq >=3D pghot_freq_threshold) + hotness |=3D BIT(PGHOT_MIGRATE_READY); + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + return !!(hotness & BIT(PGHOT_MIGRATE_READY)); +} + +int pghot_get_record(phi_t *phi, int *nid, int *freq, unsigned long *time) +{ + phi_t old_hotness, hotness =3D 0; + + old_hotness =3D READ_ONCE(*phi); + do { + if (!(old_hotness & BIT(PGHOT_MIGRATE_READY))) + return -EINVAL; + } while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness))); + + *nid =3D (old_hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK; + *freq =3D (old_hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK; + *time =3D (old_hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK; + return 0; +} diff --git a/mm/pghot.c b/mm/pghot.c index 95b5012d5b99..bf1d9029cbaa 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -10,6 +10,9 @@ * the frequency of access and last access time. Promotions are done * to a default toptier NID. * + * In the precision mode, 4 bytes are used to store the frequency + * of access, last access time and the accessing NID. + * * A kernel thread named kmigrated is provided to migrate or promote * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into @@ -52,13 +55,15 @@ static bool kmigrated_started __ro_after_init; * for the purpose of tracking page hotness and subsequent promotion. * * @pfn: PFN of the page - * @nid: Unused + * @nid: Target NID to where the page needs to be migrated in precision + * mode but unused in default mode * @src: The identifier of the sub-system that reports the access * @now: Access time in jiffies * - * Updates the frequency and time of access and marks the page as - * ready for migration if the frequency crosses a threshold. The pages - * marked for migration are migrated by kmigrated kernel thread. + * Updates the NID (in precision mode only), frequency and time of access + * and marks the page as ready for migration if the frequency crosses a + * threshold. The pages marked for migration are migrated by kmigrated + * kernel thread. * * Return: 0 on success and -EINVAL on failure to record the access. */ --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from PH0PR06CU001.outbound.protection.outlook.com (mail-westus3azon11011057.outbound.protection.outlook.com [40.107.208.57]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 849092D7384 for ; Thu, 29 Jan 2026 14:43:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.208.57 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697838; cv=fail; b=ezgEraGbfO70YuB4lX/FF/30VeLrd6OGM985dYgYb0zzQtsPHXyMu6upctduKl1sis7+1vZ4McyS/Q/iQa5evnArfPBtI6YsQyuLc5j/U8PbnjTDUc8+v9yXap3mO4KfFxnCB/G0/NI67eaDOI7jtJLiu0BBVprTx9u7p0VP/J0= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697838; c=relaxed/simple; bh=LjwL7bvnAMGF+c9XkJRSaJrJSUxLLiL30ugEbQQP/38=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=IX+dUpp2p2iH4tZv8P10pWkh+soh0NMENewFTbK5m9TZAJowLrxbAgW5W8YDqUD5CfgHRFFLaXJtM50uZb9nnRieehV+bwqNEIuECJYnQi4s6g64T6AeTw3Rat0mn5pcEhSTo2hG9VWOuRHBIlgnSTE2G+xul+qnBWuUZAi0XOE= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=FYBGgP99; arc=fail smtp.client-ip=40.107.208.57 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="FYBGgP99" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ov6GxMRxGcsCY8SZ4we00eW6IvMPXsR4+0u47M6fTzT2UZU/06xJHy7mtXISin3byoNv5dsWwYAg1pmsuXua0VyXaXtP9BbTfUDVXzVOqm62pIAXMN5E/2UFQ6cdYvJNDiCUqAG66lZlTVTSJg8FmiOxPFJGI+R9JXuQqW5P3eF1+kdKdD7d96AXgzcdgfVlxhD+okj9h2LNHTnawmNmISvI1NkcETjc4LqzZ5LRGQcL/A1HaJ61AxNqY2FV3OG7WWdtqrusbeCzAbzkEa2M1Yx8CeeLPvhjeRGz7o/92NDAm3os5+982fuxfJdKHfbl3TwSk51Ze6Xk6Zj+ssrENA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=geW2Hchw2hipgNtUXmkYBzbL3lmxlreN3f1iPu8nAVM=; b=lQG/nHplw+Ky2aOdNxTkO2/4n6jISLU/1uQ8DFJmuI7lt7qy8NsHBvB+NQ5XpR/KIFJCtfsT20uqxob9/em9xKQLuywPiOFTCsx/e02yAUrvRibvJttzIguUoJ6e8f48oW3c8/M4qmkUhpObHMfgjQFTSOCC11Lj5w4u8UxFwOtyYiFe0WUX57qR+RbdHGavHK6jga+yNsIfF9UempQZNX4h3OxAamxhYq0+NLY72mQ2oPwD7KPCbtLcjNr8ztlbtkVP3qxv8yFnhpOX1Liwfws8seBInzepWSuJSQPNUK3IDFObLXAkEKkhQ1D8+wiVF1El2BL8/PCr3d3pYfl+/Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=geW2Hchw2hipgNtUXmkYBzbL3lmxlreN3f1iPu8nAVM=; b=FYBGgP99z14DgkUKkXE1tjotMXUmt0p/v2pImm2tk7t5dfn2Isj0yIen0oviQoOyYPXLh992OPbFA7mXYStUDgx1Tm1IajtDa5SzZQQnQwiJt7etVhUCCPXBe45vY36LU24FIsMdYpDcB4smbg+7+QX83+yzyDahmkxQkd0hio4= Received: from BL1PR13CA0426.namprd13.prod.outlook.com (2603:10b6:208:2c3::11) by MW4PR12MB7215.namprd12.prod.outlook.com (2603:10b6:303:228::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:43:45 +0000 Received: from BN2PEPF000055DA.namprd21.prod.outlook.com (2603:10b6:208:2c3:cafe::43) by BL1PR13CA0426.outlook.office365.com (2603:10b6:208:2c3::11) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.7 via Frontend Transport; Thu, 29 Jan 2026 14:43:39 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055DA.mail.protection.outlook.com (10.167.245.4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:43:39 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:43:29 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 05/10] mm: sched: move NUMA balancing tiering promotion to pghot Date: Thu, 29 Jan 2026 20:10:38 +0530 Message-ID: <20260129144043.231636-6-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055DA:EE_|MW4PR12MB7215:EE_ X-MS-Office365-Filtering-Correlation-Id: 16c747cb-30e5-4748-1927-08de5f44caa6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|7416014|36860700013|82310400026|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?ihCqLzDMbu97E9XSPwxAArG1KN5ZEtbqekEfDwvHAKmxxENB0flaC0KN9cAM?= =?us-ascii?Q?a2TpPa56wu/rVwbo8IV6HN8ZiremCDq7YWIb0StYURq8hQ7xIjytyEEKDOiP?= =?us-ascii?Q?aLeRvjHDSmH/CcyZ8fH7n3GA7xjEfmLol50CLtS9fqAdhe/xMp+gLFqQsBs/?= =?us-ascii?Q?RpqbJK1B+R1EXxcPJO/t2z57XQzK9Pl6u2+hutzopAXf/b8X8ag1Jhe1jKtQ?= =?us-ascii?Q?c+AfxAKaWSlhQxTXEv9ais+0TcwPJ1kK1LMqWAhGEQnd/Xgu5msKkegfXIU7?= =?us-ascii?Q?DTrz0sQJUw8RcInfBX6XJnGCRRlOYcQ/6U8lYDf3ErB8V9zdV/QyAazAOUfz?= =?us-ascii?Q?miEq4ogev4FIcZot33I0XW/rfU8YShHTldSdoQaBITsO+LMRTZMiL1O+loSw?= =?us-ascii?Q?WtorntWtDXq8qElYbUcBz4dMYPO0eL+vRVHs4OmlWeHsKVM/FGvqQBSC+O/e?= =?us-ascii?Q?S7pvz9M+gYV8AsNHuVHwf1lUpemtRyw002oBf3Fp2zCGIybFHYMpWdT8KRyl?= =?us-ascii?Q?+z9GDLL7dka37tOwH0K8gXi1g6ppIPmAaf/I3VNEIFckGLiArkjg7V8JVFHu?= =?us-ascii?Q?vH8HhX+BwOFSQMFdRjJoLovYNU33mpRwURHuN/Dg6lzddMrDywNfrY+ltjXM?= =?us-ascii?Q?HiUyMZBUGCso0qzrQp9MWC1ynfnul28p9UmqlmrIm+kk2PgN6xzmxmtm5iT7?= =?us-ascii?Q?+YCAF3ZhCshNcjH6r+gOFE5giXbevKBNtMfDk0zBzk7wAb3MAv9CylNsHJPY?= =?us-ascii?Q?4Mu7u+BNHcxOkHybT4pqMHA9ZBRUVnpRahzllRgnkmvz1p7T3ysD7kk9YcCA?= =?us-ascii?Q?6y/FlIYYrQMDpqNouG+mawLI8keMRNjZX134SzfStoYIEQii19P/bVMOKUan?= =?us-ascii?Q?32lyVfJBM83lBGPrk18i3ji5iQ/uh/oQ1sxN9fq9a7yeZTFOMF8ckNjg42lq?= =?us-ascii?Q?tPsZIPGEgRGLcAnacnUVHNz9B40FRhgVhlldNWLTyCTbVUKQbJWVR2dmIeTP?= =?us-ascii?Q?RJ922FZD17KhixC/mJmvoCBH3ddQSX5cso6E46bTIENbM+xZY/P8BTPb3FB1?= =?us-ascii?Q?PtlWZhfzAKYjJlgIIH//BbK1eq1FQ6qbsJDOaU5PyKK7PaLGsJJAG49CWHEU?= =?us-ascii?Q?2iQh/EfxnqcR824aIsYe7PiOwzmJVTe5qfljWw18BkQBoRjAJWdLKwsnkQbH?= =?us-ascii?Q?4KxGHJRlEVqjUKQC2jJtbVlfz8sZqECXuDMEi2YTQGuPPIubwKOiVXnckhtt?= =?us-ascii?Q?cBs/pPT6RvnAttvw3CzO+escJJQPqrzYNMhfcOIZ2JU3YdREeeoe8VxmP6AM?= =?us-ascii?Q?h8gyqhbfMVQJ8tpU1a7JXL/ByRsro/u+Wffvo/EZsHRwz/dvZpb04WYYpucR?= =?us-ascii?Q?6W0W7V6uao5PdZ765fzL9ffkwwiS5lzjN4nTnFqnxpoMBLI37OrcKHk/yVcu?= =?us-ascii?Q?0NmW9cg6lgc5SQ50gSl9B6qfZX2y6TH9jEmvttlr2Ak/WKCBjfDggEWMgBIo?= =?us-ascii?Q?+Fg+5JYFvOm9hHI5YY2qEDHBHl9H/wbxxoL4gaxx1qgQ7hQJVbQbH28+0k1C?= =?us-ascii?Q?ghoidCV7pTRd8RqvkWM1kRlCaW2oldplVSRBJLEweYPxiS8gHa53AfcSGKVe?= =?us-ascii?Q?2KZ/+m6oekfQjfqF8CJGrr8H0JrNmyMzOcj5FrQdYtPOhPzrU6Uv6jbTzAuw?= =?us-ascii?Q?tJL41g=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(7416014)(36860700013)(82310400026)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:43:39.5047 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 16c747cb-30e5-4748-1927-08de5f44caa6 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055DA.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: MW4PR12MB7215 Content-Type: text/plain; charset="utf-8" Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING mode of NUMA Balancing) does hot page detection (via hint faults), hot page classification and eventual promotion, all by itself and sits within the scheduler. With pghot, the new hot page tracking and promotion mechanism being available, NUMA Balancing can limit itself to detection of hot pages (via hint faults) and off-load rest of the functionality to the common hot page tracking system. pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the hot page info to pghot. In addition, the migration rate limiting and dynamic threshold logic are moved to kmigrated so that the same can be used for hot pages reported by other sources too. Signed-off-by: Bharata B Rao --- kernel/sched/debug.c | 1 - kernel/sched/fair.c | 152 ++----------------------------------------- mm/huge_memory.c | 26 ++------ mm/memory.c | 31 ++------- mm/pghot.c | 124 +++++++++++++++++++++++++++++++++++ 5 files changed, 141 insertions(+), 193 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 41caa22e0680..02931902a9c6 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -520,7 +520,6 @@ static __init int sched_init_debug(void) debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_min); debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balanci= ng_scan_period_max); debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_sca= n_size); - debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing= _hot_threshold); #endif /* CONFIG_NUMA_BALANCING */ =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da46c3164537..4e70f58fbbfa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu) static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ -static unsigned int sysctl_numa_balancing_promote_rate_limit =3D 65536; -#endif - #ifdef CONFIG_SYSCTL static const struct ctl_table sched_fair_sysctls[] =3D { #ifdef CONFIG_CFS_BANDWIDTH @@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] =3D= { .extra1 =3D SYSCTL_ONE, }, #endif -#ifdef CONFIG_NUMA_BALANCING - { - .procname =3D "numa_balancing_promote_rate_limit_MBps", - .data =3D &sysctl_numa_balancing_promote_rate_limit, - .maxlen =3D sizeof(unsigned int), - .mode =3D 0644, - .proc_handler =3D proc_dointvec_minmax, - .extra1 =3D SYSCTL_ZERO, - }, -#endif /* CONFIG_NUMA_BALANCING */ }; =20 static int __init sched_fair_sysctl_init(void) @@ -1427,9 +1412,6 @@ unsigned int sysctl_numa_balancing_scan_size =3D 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in m= s */ unsigned int sysctl_numa_balancing_scan_delay =3D 1000; =20 -/* The page with hint page fault latency < threshold in ms is considered h= ot */ -unsigned int sysctl_numa_balancing_hot_threshold =3D MSEC_PER_SEC; - struct numa_group { refcount_t refcount; =20 @@ -1784,108 +1766,6 @@ static inline bool cpupid_valid(int cpupid) return cpupid_to_cpu(cpupid) < nr_cpu_ids; } =20 -/* - * For memory tiering mode, if there are enough free pages (more than - * enough watermark defined here) in fast memory node, to take full - * advantage of fast memory capacity, all recently accessed slow - * memory pages will be migrated to fast memory node without - * considering hot threshold. - */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) -{ - int z; - unsigned long enough_wmark; - - enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); - for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { - struct zone *zone =3D pgdat->node_zones + z; - - if (!populated_zone(zone)) - continue; - - if (zone_watermark_ok(zone, 0, - promo_wmark_pages(zone) + enough_wmark, - ZONE_MOVABLE, 0)) - return true; - } - return false; -} - -/* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, - * - * hint page fault latency =3D hint page fault time - scan time - * - * The smaller the hint page fault latency, the higher the possibility - * for the page to be hot. - */ -static int numa_hint_fault_latency(struct folio *folio) -{ - int last_time, time; - - time =3D jiffies_to_msecs(jiffies); - last_time =3D folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; -} - -/* - * For memory tiering mode, too high promotion/demotion throughput may - * hurt application latency. So we provide a mechanism to rate limit - * the number of pages that are tried to be promoted. - */ -static bool numa_promotion_rate_limit(struct pglist_data *pgdat, - unsigned long rate_limit, int nr) -{ - unsigned long nr_cand; - unsigned int now, start; - - now =3D jiffies_to_msecs(jiffies); - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - start =3D pgdat->nbp_rl_start; - if (now - start > MSEC_PER_SEC && - cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) - pgdat->nbp_rl_nr_cand =3D nr_cand; - if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) - return true; - return false; -} - -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now =3D jiffies_to_msecs(jiffies); - th_period =3D sysctl_numa_balancing_scan_period_max; - start =3D pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { - ref_cand =3D rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; - unit_th =3D ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th =3D pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th =3D max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th =3D min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand =3D nr_cand; - pgdat->nbp_threshold =3D th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -1901,33 +1781,11 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct folio *folio, =20 /* * The pages in slow memory node should be migrated according - * to hot/cold instead of private/shared. - */ - if (folio_use_access_time(folio)) { - struct pglist_data *pgdat; - unsigned long rate_limit; - unsigned int latency, th, def_th; - long nr =3D folio_nr_pages(folio); - - pgdat =3D NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { - /* workload changed, reset hot threshold */ - pgdat->nbp_threshold =3D 0; - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr); - return true; - } - - def_th =3D sysctl_numa_balancing_hot_threshold; - rate_limit =3D MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th =3D pgdat->nbp_threshold ? : def_th; - latency =3D numa_hint_fault_latency(folio); - if (latency >=3D th) - return false; - - return !numa_promotion_rate_limit(pgdat, rate_limit, nr); - } + * to hot/cold instead of private/shared. Also the migration + * of such pages are handled by kmigrated. + */ + if (folio_use_access_time(folio)) + return true; =20 this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D folio_xchg_last_cpupid(folio, this_cpupid); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40cf59301c21..f52587e70b3c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -2217,29 +2218,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *v= mf) =20 target_nid =3D numa_migrate_check(folio, vmf, haddr, &flags, writable, &last_cpupid); + nid =3D target_nid; if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { - flags |=3D TNF_MIGRATE_FAIL; - goto out_map; - } - /* The folio is isolated and isolation code holds a folio reference. */ - spin_unlock(vmf->ptl); - writable =3D false; =20 - if (!migrate_misplaced_folio(folio, target_nid)) { - flags |=3D TNF_MIGRATED; - nid =3D target_nid; - task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); - return 0; - } + writable =3D false; =20 - flags |=3D TNF_MIGRATE_FAIL; - vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { - spin_unlock(vmf->ptl); - return 0; - } out_map: /* Restore the PMD */ pmd =3D pmd_modify(pmdp_get(vmf->pmd), vma->vm_page_prot); @@ -2250,8 +2234,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vm= f) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); spin_unlock(vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) + if (nid !=3D NUMA_NO_NODE) { + pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies); task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags); + } return 0; } =20 diff --git a/mm/memory.c b/mm/memory.c index 2a55edc48a65..98a9a3b675a0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -75,6 +75,7 @@ #include #include #include +#include #include #include #include @@ -6046,34 +6047,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) =20 target_nid =3D numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); + nid =3D target_nid; if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { - flags |=3D TNF_MIGRATE_FAIL; - goto out_map; - } - /* The folio is isolated and isolation code holds a folio reference. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); + writable =3D false; ignore_writable =3D true; - - /* Migrate to the requested node */ - if (!migrate_misplaced_folio(folio, target_nid)) { - nid =3D target_nid; - flags |=3D TNF_MIGRATED; - task_numa_fault(last_cpupid, nid, nr_pages, flags); - return 0; - } - - flags |=3D TNF_MIGRATE_FAIL; - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); - if (unlikely(!vmf->pte)) - return 0; - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } out_map: /* * Make it present again, depending on how arch implements @@ -6087,8 +6066,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) writable); pte_unmap_unlock(vmf->pte, vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) + if (nid !=3D NUMA_NO_NODE) { + pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies); task_numa_fault(last_cpupid, nid, nr_pages, flags); + } return 0; } =20 diff --git a/mm/pghot.c b/mm/pghot.c index bf1d9029cbaa..6fc76c1eaff8 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -17,6 +17,9 @@ * the hot pages. kmigrated runs for each lower tier node. It iterates * over the node's PFNs and migrates pages marked for migration into * their targeted nodes. + * + * Migration rate-limiting and dynamic threshold logic implementations + * were moved from NUMA Balancing mode 2. */ #include #include @@ -31,6 +34,12 @@ unsigned int kmigrated_batch_nr =3D KMIGRATED_DEFAULT_BA= TCH_NR; =20 unsigned int sysctl_pghot_freq_window =3D PGHOT_DEFAULT_FREQ_WINDOW; =20 +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +static unsigned int sysctl_pghot_promote_rate_limit =3D 65536; + +#define KMIGRATED_MIGRATION_ADJUST_STEPS 16 +#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW 60000 + DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); DEFINE_STATIC_KEY_FALSE(pghot_src_pgtscans); DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); @@ -45,6 +54,14 @@ static const struct ctl_table pghot_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ZERO, }, + { + .procname =3D "pghot_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, }; #endif =20 @@ -138,6 +155,110 @@ int pghot_record_access(unsigned long pfn, int nid, i= nt src, unsigned long now) return 0; } =20 +/* + * For memory tiering mode, if there are enough free pages (more than + * enough watermark defined here) in fast memory node, to take full + * advantage of fast memory capacity, all recently accessed slow + * memory pages will be migrated to fast memory node without + * considering hot threshold. + */ +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long enough_wmark; + + enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { + struct zone *zone =3D pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + promo_wmark_pages(zone) + enough_wmark, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsi= gned long rate_limit, + int nr, unsigned long now_ms) +{ + unsigned long nr_cand; + unsigned int start; + + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start =3D pgdat->nbp_rl_start; + if (now_ms - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now_ms) =3D=3D start) + pgdat->nbp_rl_nr_cand =3D nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) + return true; + return false; +} + +static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, unsigned int ref_th, + unsigned long now_ms) +{ + unsigned int start, th_period, unit_th, th; + unsigned long nr_cand, ref_cand, diff_cand; + + th_period =3D KMIGRATED_PROMOTION_THRESHOLD_WINDOW; + start =3D pgdat->nbp_th_start; + if (now_ms - start > th_period && + cmpxchg(&pgdat->nbp_th_start, start, now_ms) =3D=3D start) { + ref_cand =3D rate_limit * + KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC; + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; + unit_th =3D ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS; + th =3D pgdat->nbp_threshold ? : ref_th; + if (diff_cand > ref_cand * 11 / 10) + th =3D max(th - unit_th, unit_th); + else if (diff_cand < ref_cand * 9 / 10) + th =3D min(th + unit_th, ref_th * 2); + pgdat->nbp_th_nr_cand =3D nr_cand; + pgdat->nbp_threshold =3D th; + } +} + +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int ni= d, + unsigned long time) +{ + struct pglist_data *pgdat; + unsigned long rate_limit; + unsigned int th, def_th; + unsigned long now_ms =3D jiffies_to_msecs(jiffies); /* Based on full-widt= h jiffies */ + unsigned long now =3D jiffies; + + pgdat =3D NODE_DATA(nid); + if (pgdat_free_space_enough(pgdat)) { + /* workload changed, reset hot threshold */ + pgdat->nbp_threshold =3D 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages); + return true; + } + + def_th =3D sysctl_pghot_freq_window; + rate_limit =3D MB_TO_PAGES(sysctl_pghot_promote_rate_limit); + kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms); + + th =3D pgdat->nbp_threshold ? : def_th; + if (pghot_access_latency(time, now) >=3D th) + return false; + + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_m= s); +} + static int pghot_get_hotness(unsigned long pfn, int *nid, int *freq, unsigned long *time) { @@ -197,6 +318,9 @@ static void kmigrated_walk_zone(unsigned long start_pfn= , unsigned long end_pfn, if (folio_nid(folio) =3D=3D nid) goto out_next; =20 + if (!kmigrated_should_migrate_memory(nr, nid, time)) + goto out_next; + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) goto out_next; =20 --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from PH8PR06CU001.outbound.protection.outlook.com (mail-westus3azon11012011.outbound.protection.outlook.com [40.107.209.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C74030BF66 for ; Thu, 29 Jan 2026 14:44:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.209.11 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697864; cv=fail; b=RbKmc6p0oDOxT31tKSRMpI0QysrlWwncaqk218Gski5bcHc6R1VHfLsUYigT5cFzlgIOBeDXPf8ybrkLF09fYRj3cVeAwEnGpWPMSKyGrPdz8mrR3DMDLDSmCEzEmdb73vKGAf61DFUtdk7pQrCgunazUIbPjmQ6LfjKuoGcTuQ= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697864; c=relaxed/simple; bh=IvGdyCIQ/kZ/or5sDLhEOhF2fr6fdj1Ne8ZSwuLJrzs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QbdnrWgCTDaIVFZGCBGEApbFsG771Rr4bQJ/ejtBFmfkuTghTWERzlDyeFDlKWNaOF0sOLzJMquDiXUn5I24WdRKVnw/yPbEUg9cMtMsQ4dr57++n+gKUPG0m0hMVHoKqr6xIyAzcp1eCGLE7clL67v/e1TrjYir1VU4i7ktL2Y= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=3b+7pqk7; arc=fail smtp.client-ip=40.107.209.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="3b+7pqk7" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ujYCR9sdRJhGAYUBjWU303fLyLG0cb7bR+v5Y24NbL/mKj7wpq+F3ZuVET/syi508zfVn0XUtqfiegQeC1UbLK4EyqtaubDbsDx6eZD5SbbS47uGmTwcwPbnBTr9WCpsqlDBa7MFt0bND2HI4vLFuG0fyVIPncyiqQeelzCL/9GWQySFw2+P09zUFONpi0l1yJ5dBniATgZFimCJ/cX45hDJPnjpfCzSSXa/esYvTduG+zlhGZBvOLj6LNV04SSmw18KStwDr0gNxxHD51rmX664ZJgw3ivPIT58Jump8LXhhTegP/hsN/YAJNYFquWurDh0RdMVkSYJKtRk7iKZGQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5zyXb6PRMNn1wV8HEhtyzL1c9ciblwBC2YNS1A9HsW0=; b=MmZSreioy+cLs6IcK83C+P5ttWzTg3ifTQU4f/VKbwpwWzqhDroRfZqoyn9x1n6y/8DbOsWggIhjumZHJ1EDVhYKwiVQGd4vkuktlksj/3Pc31Avz7LWoSrxsWOHxJAoocLjPGi9EDZijW3+TZpy2k1vKHlhdNzVfmwtjBmXyAdtO5jDmr7XCMf4lJxruxp1+IJMsx9qnsUJ/LyWKMzq3tpL0vaYk2o3657pWm33pmdgffYRP0UsuPn2ZCtYnxp0Is82a+rkVrBW5DK7kXuTgm8wS89ThkwcXZVg3aWS9CwWQxMuoYLA+J+jR0rAI/2mXDXqJIb0puzQVPKGshJ3Hg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=5zyXb6PRMNn1wV8HEhtyzL1c9ciblwBC2YNS1A9HsW0=; b=3b+7pqk71wfk1J+VIcYWyqLgnKRR/NRiiCTCDGp6PolTEkNbRoAtxyrtnGl02Of0GiESyEGEJyeLcnfBcTFGLtz68yh0Cou7cCw8ZBjJSGRjd+fp9dkHlimQJN3hQr3MYlg+bZA5WSTx2gZ8noKxFQuHzLe/WmyPI952NJk5wYk= Received: from BN0PR04CA0098.namprd04.prod.outlook.com (2603:10b6:408:ec::13) by PH7PR12MB5783.namprd12.prod.outlook.com (2603:10b6:510:1d2::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:44:09 +0000 Received: from BN2PEPF000055E0.namprd21.prod.outlook.com (2603:10b6:408:ec:cafe::ed) by BN0PR04CA0098.outlook.office365.com (2603:10b6:408:ec::13) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9542.16 via Frontend Transport; Thu, 29 Jan 2026 14:43:50 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055E0.mail.protection.outlook.com (10.167.245.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:44:09 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:44:01 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 06/10] x86: ibs: In-kernel IBS driver for memory access profiling Date: Thu, 29 Jan 2026 20:10:39 +0530 Message-ID: <20260129144043.231636-7-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055E0:EE_|PH7PR12MB5783:EE_ X-MS-Office365-Filtering-Correlation-Id: 4e3570a1-62ad-46a8-8cd5-08de5f44dc94 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|376014|7416014|1800799024|36860700013|13003099007; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?h04WKilV/ZrpmKBCQM/CLVdvTvDyagINShho3s3MDOMjNkt3PJDE9PAaBAN3?= =?us-ascii?Q?3w50jg/yyuGNGNtMa4il9E2q/L0xzZueFa0nxIxocvZO6IRIeVAJsSAJXRFv?= =?us-ascii?Q?lWvV61Y9ygADm57lFh2a0WXOsdF7LHyzwRzdyUYsdbNEpVuoFNdYjTL9wzzG?= =?us-ascii?Q?CoyAy0VE6bZrIkxcueaOJ8mD3+Acb8je9jMQgXJcZrDCT+x95sQU8MGWFjV5?= =?us-ascii?Q?pCN75nz23S5N2TTRW4Fs94bZYNjNr4PIYotOnvQFEhjX6hYewLmDOXzM0kud?= =?us-ascii?Q?+RFrkXXaqgWMH9x95OL5daGgRmmCCOopTIf4DoDJH3h4Q1YGKVKlUG/Thi1y?= =?us-ascii?Q?HLyLkOIi5HtidA3Z4Pc5LdQ3kzLiUbnT6ohhhgJnseKg22HV5CmFNFXxfS2t?= =?us-ascii?Q?6mjTuPmR5RF8Eauill3GmtwALfr/38mbqb2DDIU7/XT+1py76oZez+3Kyb8U?= =?us-ascii?Q?f/a++lSMk0OGyGYV2EYPYXzNMWAE+YREvgkx9mUAQR+X50sabaLn1TR64J2w?= =?us-ascii?Q?pDUYzUOT7IHoqIZlIX3pDz1ou5jAxiOhHkaNONVhVOYdpPxXaDO6kSvjX36m?= =?us-ascii?Q?isUblxvvZsqBZ9lUdaFuV5VogLmCwkYplaAHiy5J7daOUxvPzgpxDAtYOZj9?= =?us-ascii?Q?xXd9wq3pmDubkzghxPlq3TRD/zzhxm625bgd8ahPBMoGWhpzaDP8WeaHmmUe?= =?us-ascii?Q?ZLM3TmwIezJ2nZN+WX2xF1Khl/29WJ62OzY0coud+zwT0ZiZFX0TvsF/RYrB?= =?us-ascii?Q?XG1HaEXRCy0R94V/CsrD5MTOXPs8kwMlLVLJarxNHlfNBcVppZT1m4NXIQWe?= =?us-ascii?Q?Dnt9FaRJ9128ONYWBcDfxAahrFR/G6RgzaYGUxETF0sqqP2GHqtPa3zqdQ+v?= =?us-ascii?Q?yLmmCdttx4n7V215pOC+YTWCtWvVtme0iD0/JQRBSzRnDEWtTR6VSUT0x9EX?= =?us-ascii?Q?GaL2M2H/YUzdLUR5/U6F9/AV4XbI+KHDHCcya1YIOYr6EarzcGu31pTJPAA8?= =?us-ascii?Q?3qKBDJU7tO4714vV8mo83oByzxYaEgCDcK8tCFQ9oHyltUsvR1WrCP9o9VfU?= =?us-ascii?Q?kusEb3Lg5mokP/+lSQgpGrZr5JT5Whq5+nrVT2382aVpOW09mRoQF/u8y0hy?= =?us-ascii?Q?uTC7/V9NJKndMro3pSAiFFgiG2OY1lLy/4A1eXyGXXCLjxwBrx8lWPu3oP3m?= =?us-ascii?Q?ciFr1G0iWFvBmwOe27XT+C1LCoPBGV4Ftr7NnTryoEejrS40oRbSk+1FWnkJ?= =?us-ascii?Q?XS6dbRUFecEyiF/xNadbskwutd9YDuT7RXJ8OoLQb011Y343D7ecBJGyYaul?= =?us-ascii?Q?N4tSrMGhXlcuo8W3uQ+4rbkflUtOh2wi5iC+sPaEpXMsjjRiHjFuS76wLZCI?= =?us-ascii?Q?1Hq+IP50M5jxToAxKHFatTHrRXnS+dyYM4SgxOP38pYoN9t4JWZVR1y448ZO?= =?us-ascii?Q?DpO7/DFQI9ZWY9nuTDh+h6NZEBaAYjjTjcJwmUghUKB1mN5cDnN+LoKQ2NA1?= =?us-ascii?Q?HNOz41Z4pF5QXvNQYIjAo7g2UvKVsutRveh7c3FUqg59yIzqchixjDIQ8sF1?= =?us-ascii?Q?Yqu9pBWqIae5kCTZpA9dmJf/HH0df2O8ghmwxPFcy7tTqpHgtHMx83xyI+Zj?= =?us-ascii?Q?HZ22acsg8cb0fX3H74QPdcpl1mtd8lNd9V7RvRcCNHrUJg+p8Sibzkw6F+Cj?= =?us-ascii?Q?KInIbwRk8Q5ZKWZeYljDp6fDgAA=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(376014)(7416014)(1800799024)(36860700013)(13003099007);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:44:09.5899 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 4e3570a1-62ad-46a8-8cd5-08de5f44dc94 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055E0.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR12MB5783 Content-Type: text/plain; charset="utf-8" Use IBS (Instruction Based Sampling) feature present in AMD processors for memory access tracking. The access information obtained from IBS via NMI is fed to pghot sub-system for futher action. In addition to many other information related to the memory access, IBS provides physical (and virtual) address of the access and indicates if the access came from slower tier. Only memory accesses originating from slower tiers are further acted upon by this driver. The samples are initially accumulated in percpu buffers which are flushed to pghot hot page tracking mechanism using irq_work. TODO: Many counters are added to vmstat just as debugging aid for now. About IBS --------- IBS can be programmed to provide data about instruction execution periodically. This is done by programming a desired sample count (number of ops) in a control register. When the programmed number of ops are dispatched, a micro-op gets tagged, various information about the tagged micro-op's execution is populated in IBS execution MSRs and an interrupt is raised. While IBS provides a lot of data for each sample, for the purpose of memory access profiling, we are interested in linear and physical address of the memory access that reached DRAM. Recent AMD processors provide further filtering where it is possible to limit the sampling to those ops that had an L3 miss which greately reduces the non-useful samples. While IBS provides capability to sample instruction fetch and execution, only IBS execution sampling is used here to collect data about memory accesses that occur during the instruction execution. More information about IBS is available in Sec 13.3 of AMD64 Architecture Programmer's Manual, Volume 2:System Programming which is present at: https://bugzilla.kernel.org/attachment.cgi?id=3D288923 Information about MSRs used for programming IBS can be found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h Model 11h B1 which is currently present at: https://www.amd.com/system/files/TechDocs/55901_0.25.zip Signed-off-by: Bharata B Rao --- arch/x86/events/amd/ibs.c | 10 + arch/x86/include/asm/msr-index.h | 16 ++ arch/x86/mm/Makefile | 1 + arch/x86/mm/ibs.c | 317 +++++++++++++++++++++++++++++++ include/linux/pghot.h | 8 + include/linux/vm_event_item.h | 19 ++ mm/Kconfig | 13 ++ mm/vmstat.c | 19 ++ 8 files changed, 403 insertions(+) create mode 100644 arch/x86/mm/ibs.c diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index aca89f23d2e0..dc544d084c17 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -13,6 +13,7 @@ #include #include #include +#include =20 #include #include @@ -1760,6 +1761,15 @@ static __init int amd_ibs_init(void) { u32 caps; =20 + /* + * TODO: Find a clean way to disable perf IBS so that IBS + * can be used for memory access profiling. + */ + if (hwmem_access_profiler_inuse()) { + pr_info("IBS isn't available for perf use\n"); + return 0; + } + caps =3D __get_ibs_caps(); if (!caps) return -ENODEV; /* ibs not supported by the cpu */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-in= dex.h index 3d0a0950d20a..3c5d69ec83a2 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -784,6 +784,22 @@ /* AMD Last Branch Record MSRs */ #define MSR_AMD64_LBR_SELECT 0xc000010e =20 +/* AMD IBS MSR bits */ +#define MSR_AMD64_IBSOPDATA2_DATASRC 0x7 +#define MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE 0x1 +#define MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR 0x2 +#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM 0x3 +#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE 0x5 +#define MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM 0x8 +#define MSR_AMD64_IBSOPDATA2_RMTNODE 0x10 + +#define MSR_AMD64_IBSOPDATA3_LDOP BIT_ULL(0) +#define MSR_AMD64_IBSOPDATA3_STOP BIT_ULL(1) +#define MSR_AMD64_IBSOPDATA3_DCMISS BIT_ULL(7) +#define MSR_AMD64_IBSOPDATA3_LADDR_VALID BIT_ULL(17) +#define MSR_AMD64_IBSOPDATA3_PADDR_VALID BIT_ULL(18) +#define MSR_AMD64_IBSOPDATA3_L2MISS BIT_ULL(20) + /* Zen4 */ #define MSR_ZEN4_BP_CFG 0xc001102e #define MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT 4 diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5b9908f13dcf..361a456582e9 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -57,3 +57,4 @@ obj-$(CONFIG_X86_MEM_ENCRYPT) +=3D mem_encrypt.o obj-$(CONFIG_AMD_MEM_ENCRYPT) +=3D mem_encrypt_amd.o =20 obj-$(CONFIG_AMD_MEM_ENCRYPT) +=3D mem_encrypt_boot.o +obj-$(CONFIG_HWMEM_PROFILER) +=3D ibs.o diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c new file mode 100644 index 000000000000..752f688375f9 --- /dev/null +++ b/arch/x86/mm/ibs.c @@ -0,0 +1,317 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +#include +#include /* TODO: Move defns like IBS_OP_ENABLE into no= n-perf header */ +#include + +bool hwmem_access_profiling; + +static u64 ibs_config __read_mostly; +static u32 ibs_caps; + +#define IBS_NR_SAMPLES 150 + +/* + * Basic access info captured for each memory access. + */ +struct ibs_sample { + unsigned long pfn; + unsigned long time; /* jiffies when accessed */ + int nid; /* Accessing node ID, if known */ +}; + +/* + * Percpu buffer of access samples. Samples are accumulated here + * before pushing them to pghot sub-system for further action. + */ +struct ibs_sample_pcpu { + struct ibs_sample samples[IBS_NR_SAMPLES]; + int head, tail; +}; + +struct ibs_sample_pcpu __percpu *ibs_s; + +/* + * The workqueue for pushing the percpu access samples to pghot sub-system. + */ +static struct work_struct ibs_work; +static struct irq_work ibs_irq_work; + +bool hwmem_access_profiler_inuse(void) +{ + return hwmem_access_profiling; +} + +/* + * Record the IBS-reported access sample in percpu buffer. + * Called from IBS NMI handler. + */ +static int ibs_push_sample(unsigned long pfn, int nid, unsigned long time) +{ + struct ibs_sample_pcpu *ibs_pcpu =3D raw_cpu_ptr(ibs_s); + int next =3D ibs_pcpu->head + 1; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + if (next =3D=3D ibs_pcpu->tail) + return 0; + + ibs_pcpu->samples[ibs_pcpu->head].pfn =3D pfn; + ibs_pcpu->samples[ibs_pcpu->head].time =3D time; + ibs_pcpu->samples[ibs_pcpu->head].nid =3D nid; + ibs_pcpu->head =3D next; + return 1; +} + +static int ibs_pop_sample(struct ibs_sample *s) +{ + struct ibs_sample_pcpu *ibs_pcpu =3D raw_cpu_ptr(ibs_s); + + int next =3D ibs_pcpu->tail + 1; + + if (ibs_pcpu->head =3D=3D ibs_pcpu->tail) + return 0; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + *s =3D ibs_pcpu->samples[ibs_pcpu->tail]; + ibs_pcpu->tail =3D next; + return 1; +} + +/* + * Remove access samples from percpu buffer and send them + * to pghot sub-system for further action. + */ +static void ibs_work_handler(struct work_struct *work) +{ + struct ibs_sample s; + + while (ibs_pop_sample(&s)) + pghot_record_access(s.pfn, s.nid, PGHOT_HW_HINTS, s.time); +} + +static void ibs_irq_handler(struct irq_work *i) +{ + schedule_work_on(smp_processor_id(), &ibs_work); +} + +/* + * IBS NMI handler: Process the memory access info reported by IBS. + * + * Reads the MSRs to collect all the information about the reported + * memory access, validates the access, stores the valid sample and + * schedules the work on this CPU to further process the sample. + */ +static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs) +{ + struct mm_struct *mm =3D current->mm; + u64 ops_ctl, ops_data3, ops_data2; + u64 laddr =3D -1, paddr =3D -1; + u64 data_src, rmt_node; + struct page *page; + unsigned long pfn; + + rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl); + + /* + * When IBS sampling period is reprogrammed via read-modify-update + * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with + * IBS_OP_ENABLE not set. For such cases, return as HANDLED. + * + * With this, the handler will say "handled" for all NMIs that + * aren't related to this NMI. This stems from the limitation of + * having both status and control bits in one MSR. + */ + if (!(ops_ctl & IBS_OP_VAL)) + goto handled; + + wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL); + + count_vm_event(HWHINT_NR_EVENTS); + + if (!user_mode(regs)) { + count_vm_event(HWHINT_KERNEL); + goto handled; + } + + if (!mm) { + count_vm_event(HWHINT_KTHREAD); + goto handled; + } + + rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3); + + /* Load/Store ops only */ + /* TODO: DataSrc isn't valid for stores, so filter out stores? */ + if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP | + MSR_AMD64_IBSOPDATA3_STOP))) { + count_vm_event(HWHINT_NON_LOAD_STORES); + goto handled; + } + + /* Discard the sample if it was L1 or L2 hit */ + if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS | + MSR_AMD64_IBSOPDATA3_L2MISS))) { + count_vm_event(HWHINT_DC_L2_HITS); + goto handled; + } + + rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2); + data_src =3D ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC; + if (ibs_caps & IBS_CAPS_ZEN4) + data_src |=3D ((ops_data2 & 0xC0) >> 3); + + switch (data_src) { + case MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE: + count_vm_event(HWHINT_LOCAL_L3L1L2); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR: + count_vm_event(HWHINT_LOCAL_PEER_CACHE_NEAR); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_DRAM: + count_vm_event(HWHINT_DRAM_ACCESSES); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM: + count_vm_event(HWHINT_CXL_ACCESSES); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE: + count_vm_event(HWHINT_FAR_CACHE_HITS); + break; + } + + rmt_node =3D ops_data2 & MSR_AMD64_IBSOPDATA2_RMTNODE; + if (rmt_node) + count_vm_event(HWHINT_REMOTE_NODE); + + /* Is linear addr valid? */ + if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID) + rdmsrl(MSR_AMD64_IBSDCLINAD, laddr); + else { + count_vm_event(HWHINT_LADDR_INVALID); + goto handled; + } + + /* Discard kernel address accesses */ + if (laddr & (1UL << 63)) { + count_vm_event(HWHINT_KERNEL_ADDR); + goto handled; + } + + /* Is phys addr valid? */ + if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID) + rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr); + else { + count_vm_event(HWHINT_PADDR_INVALID); + goto handled; + } + + pfn =3D PHYS_PFN(paddr); + page =3D pfn_to_online_page(pfn); + if (!page) + goto handled; + + if (!PageLRU(page)) { + count_vm_event(HWHINT_NON_LRU); + goto handled; + } + + if (!ibs_push_sample(pfn, numa_node_id(), jiffies)) { + count_vm_event(HWHINT_BUFFER_FULL); + goto handled; + } + + irq_work_queue(&ibs_irq_work); + count_vm_event(HWHINT_USEFUL_SAMPLES); + +handled: + return NMI_HANDLED; +} + +static inline int get_ibs_lvt_offset(void) +{ + u64 val; + + rdmsrl(MSR_AMD64_IBSCTL, val); + if (!(val & IBSCTL_LVT_OFFSET_VALID)) + return -EINVAL; + + return val & IBSCTL_LVT_OFFSET_MASK; +} + +static void setup_APIC_ibs(void) +{ + int offset; + + offset =3D get_ibs_lvt_offset(); + if (offset < 0) + goto failed; + + if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0)) + return; +failed: + pr_warn("IBS APIC setup failed on cpu #%d\n", + smp_processor_id()); +} + +static void clear_APIC_ibs(void) +{ + int offset; + + offset =3D get_ibs_lvt_offset(); + if (offset >=3D 0) + setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1); +} + +static int x86_amd_ibs_access_profile_startup(unsigned int cpu) +{ + setup_APIC_ibs(); + return 0; +} + +static int x86_amd_ibs_access_profile_teardown(unsigned int cpu) +{ + clear_APIC_ibs(); + return 0; +} + +static int __init ibs_access_profiling_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_IBS)) { + pr_info("IBS capability is unavailable for access profiling\n"); + return 0; + } + + ibs_s =3D alloc_percpu_gfp(struct ibs_sample_pcpu, GFP_KERNEL | __GFP_ZER= O); + if (!ibs_s) + return 0; + + INIT_WORK(&ibs_work, ibs_work_handler); + init_irq_work(&ibs_irq_work, ibs_irq_handler); + + /* Uses IBS Op sampling */ + ibs_config =3D IBS_OP_CNT_CTL | IBS_OP_ENABLE; + ibs_caps =3D cpuid_eax(IBS_CPUID_FEATURES); + if (ibs_caps & IBS_CAPS_ZEN4) + ibs_config |=3D IBS_OP_L3MISSONLY; + + register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs"); + + cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING, + "x86/amd/ibs_access_profile:starting", + x86_amd_ibs_access_profile_startup, + x86_amd_ibs_access_profile_teardown); + + pr_info("IBS setup for memory access profiling\n"); + return 0; +} + +arch_initcall(ibs_access_profiling_init); diff --git a/include/linux/pghot.h b/include/linux/pghot.h index d3d59b0c0cf6..20ea9767dbdd 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -2,6 +2,14 @@ #ifndef _LINUX_PGHOT_H #define _LINUX_PGHOT_H =20 +#include + +#ifdef CONFIG_HWMEM_PROFILER +bool hwmem_access_profiler_inuse(void); +#else +static inline bool hwmem_access_profiler_inuse(void) { return false; } +#endif + /* Page hotness temperature sources */ enum pghot_src { PGHOT_HW_HINTS, diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 5b8fd93b55fd..67efbca9051c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -193,6 +193,25 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGHOT_RECORD_HWHINTS, PGHOT_RECORD_PGTSCANS, PGHOT_RECORD_HINTFAULTS, +#ifdef CONFIG_HWMEM_PROFILER + HWHINT_NR_EVENTS, + HWHINT_KERNEL, + HWHINT_KTHREAD, + HWHINT_NON_LOAD_STORES, + HWHINT_DC_L2_HITS, + HWHINT_LOCAL_L3L1L2, + HWHINT_LOCAL_PEER_CACHE_NEAR, + HWHINT_FAR_CACHE_HITS, + HWHINT_DRAM_ACCESSES, + HWHINT_CXL_ACCESSES, + HWHINT_REMOTE_NODE, + HWHINT_LADDR_INVALID, + HWHINT_KERNEL_ADDR, + HWHINT_PADDR_INVALID, + HWHINT_NON_LRU, + HWHINT_BUFFER_FULL, + HWHINT_USEFUL_SAMPLES, +#endif /* CONFIG_HWMEM_PROFILER */ #endif /* CONFIG_PGHOT */ NR_VM_EVENT_ITEMS }; diff --git a/mm/Kconfig b/mm/Kconfig index fde5aee3e16f..07b16aece877 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1489,6 +1489,19 @@ config PGHOT_PRECISE 4 bytes per page against the default one byte per page. Preferable to enable this on systems with multiple nodes in toptier. =20 +config HWMEM_PROFILER + bool "HW based memory access profiling" + def_bool n + depends on PGHOT + depends on X86_64 + help + Some hardware platforms are capable of providing memory access + information in direct and actionable manner. Instruction Based + Sampling (IBS) present on AMD Zen CPUs in one such example. + Memory accesses obtained via such HW based mechanisms are + rolled up to PGHOT sub-system for further action like hot page + promotion or NUMA Balancing + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/vmstat.c b/mm/vmstat.c index f6f91b9dd887..62c47f44edf0 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1506,6 +1506,25 @@ const char * const vmstat_text[] =3D { [I(PGHOT_RECORD_HWHINTS)] =3D "pghot_recorded_hwhints", [I(PGHOT_RECORD_PGTSCANS)] =3D "pghot_recorded_pgtscans", [I(PGHOT_RECORD_HINTFAULTS)] =3D "pghot_recorded_hintfaults", +#ifdef CONFIG_HWMEM_PROFILER + [I(HWHINT_NR_EVENTS)] =3D "hwhint_nr_events", + [I(HWHINT_KERNEL)] =3D "hwhint_kernel", + [I(HWHINT_KTHREAD)] =3D "hwhint_kthread", + [I(HWHINT_NON_LOAD_STORES)] =3D "hwhint_non_load_stores", + [I(HWHINT_DC_L2_HITS)] =3D "hwhint_dc_l2_hits", + [I(HWHINT_LOCAL_L3L1L2)] =3D "hwhint_local_l3l1l2", + [I(HWHINT_LOCAL_PEER_CACHE_NEAR)] =3D "hwhint_local_peer_cache_near", + [I(HWHINT_FAR_CACHE_HITS)] =3D "hwhint_far_cache_hits", + [I(HWHINT_DRAM_ACCESSES)] =3D "hwhint_dram_accesses", + [I(HWHINT_CXL_ACCESSES)] =3D "hwhint_cxl_accesses", + [I(HWHINT_REMOTE_NODE)] =3D "hwhint_remote_node", + [I(HWHINT_LADDR_INVALID)] =3D "hwhint_invalid_laddr", + [I(HWHINT_KERNEL_ADDR)] =3D "hwhint_kernel_addr", + [I(HWHINT_PADDR_INVALID)] =3D "hwhint_invalid_paddr", + [I(HWHINT_NON_LRU)] =3D "hwhint_non_lru", + [I(HWHINT_BUFFER_FULL)] =3D "hwhint_buffer_full", + [I(HWHINT_USEFUL_SAMPLES)] =3D "hwhint_useful_samples", +#endif /* CONFIG_HWMEM_PROFILER */ #endif /* CONFIG_PGHOT */ #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from CH1PR05CU001.outbound.protection.outlook.com (mail-northcentralusazon11010002.outbound.protection.outlook.com [52.101.193.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 41E8E299AB1 for ; Thu, 29 Jan 2026 14:44:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.193.2 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697886; cv=fail; b=ND0uxT/fzuQze57w7buDuzJmuzhxcehNgpRWLvSfsFALUPzvn1hgigNA6mDRmC4CNeF6FFzMslpkqgvb+LPf5jUdxiYKEpuZjdrzgjTjBEuk5fn8TYnExNT78yXP9UIduRXOifmNAU8uFJv1tg9OJe9NfNxwT5XspGhXSj0Knxw= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697886; c=relaxed/simple; bh=wW/01NZFHhF5WxTmL0OFyGxMgheZrWIxz5JxBSWj4ZU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=D4nVFgYyl/rL121v/WWkI6APXpu+XKuHM6qrn/QR/R9HM8eVB0RrutzdtDZclCqSBod7lF6pDVRPdAw81gNhm/DwjLFrrRix3Gm7Aldg80vGvZkIJfm2aSUeAlkHPfa44HXdxrzy9PlXFlgZUnvvlwnC7/6HmfOLWmRBUB7cjsw= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=07Kvx/dz; arc=fail smtp.client-ip=52.101.193.2 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="07Kvx/dz" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=IWNZcQxJoVIRE50jAhDJCW2YA529dGC4J1kTBuwNO0iFkeAHWUXSXIp647VBVQtledjVddI4/3sBAjcybuNE3uJ123aInut3aLonv+Ufve8MMrDbKpak3xZ2pX5i41n0IAaxdTXtcLwvsQl+P5s9hBj6moFMTVCA+vcePCMtWDN4bGuU97AEKJlLSd1X8to64y6Nz9BFe8yWrFkrW83qpZYx+/b4o6vT7ClM1dHeKQf6mVHzNPrZs5syxQYWN7uzNlmesksIUDJdRw74FMwPEK8OYHCPhva6VQ6+Sm770Flw2RV9J0jzeFHrdjyLFJNB8aYpSQe75b9NygQSWtpViA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=u2OkrMC7qdJVfZLoWONgXxDemyug/hm/5RsPwgSfGz0=; b=whRzcZ9cqX38VhbvicEbtCzhRWhYVbnh+857BssNdmpACO8/rU7vP9603izV2Bi0WF6oDMooHZjX8QksbxHa3TQf9fKTpdbYMxTIehT5hFTBIqQY5sg955fdY45VDgoKtW16ITPMHR/PjZQCpahtOXNyowB3kNHyuwvNHKMVRIZDDOrSkfw3A6mBuQCNsFRMDcyZuhKWaYXP0dTFkgFR8EYn5GFtl0S5/FTZjenkOGfK7m0urcTDRK6SA0ooqkHY/0ibZlZQI+42Ok8ivXeoCFXm7N4SqTOHAGp1iSlWlemgPfIGK4LeDAG1dkKZaRZpUrB5VCPPGqWa/Feeg46ASg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=u2OkrMC7qdJVfZLoWONgXxDemyug/hm/5RsPwgSfGz0=; b=07Kvx/dz90m+bYsYGDsLK2E0v+SSXnDaBKsy7a/tCdfeu69IbZ/s9a1G6X7DgNrGYTAzn1965jSE+A4icWKh0ec/4prxAs9f+gFxbEtTTaL5w2fm/VBm/Yvorpyk2RWzQlSUNdpDB4n/qrOlvAT0v6FwZZEx0OPY1cZnYsJcLY4= Received: from BL1PR13CA0367.namprd13.prod.outlook.com (2603:10b6:208:2c0::12) by CY3PR12MB9556.namprd12.prod.outlook.com (2603:10b6:930:10a::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.10; Thu, 29 Jan 2026 14:44:35 +0000 Received: from BN2PEPF000055DC.namprd21.prod.outlook.com (2603:10b6:208:2c0:cafe::8d) by BL1PR13CA0367.outlook.office365.com (2603:10b6:208:2c0::12) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9587.2 via Frontend Transport; Thu, 29 Jan 2026 14:44:27 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055DC.mail.protection.outlook.com (10.167.245.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:44:35 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:44:27 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 07/10] x86: ibs: Enable IBS profiling for memory accesses Date: Thu, 29 Jan 2026 20:10:40 +0530 Message-ID: <20260129144043.231636-8-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055DC:EE_|CY3PR12MB9556:EE_ X-MS-Office365-Filtering-Correlation-Id: 3ede1100-8178-45d7-a409-08de5f44ec02 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700013|1800799024|376014|7416014|82310400026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?/ytA+7EjjwUGJxmAfncte2N8PLj2VpJ/O8x8Nb2aeuE++8Vjecnx0+VPg2vV?= =?us-ascii?Q?d9EePBdiHk1Zs5F5ua67rNV0qEdLzI4DjOzBW/2h1Hvq11hEWT4EuNvt8M/G?= =?us-ascii?Q?Aw6NANIr/7XA/POa8OtQPR3XW9eWCSi7WC6Tg/vfAiZ6gQ2at4HF1UdO0WJ1?= =?us-ascii?Q?af8VRtHM0MOz870HqhrR1VpvIGf7/aV/m5EvKPBrDgbpmy77Mb4uvFbvCcsF?= =?us-ascii?Q?dk4QElNdCVvfqUcrzGudc97tvzZs0Ifx90oWhFH7+v/fBh/LXWZkGgnUFZnW?= =?us-ascii?Q?ZgiveR7WawMz8t2ME1aUS5S1G9lCi6KkyMQPyu+/RTpo6eEpium4kL7GlHzq?= =?us-ascii?Q?v77goYeVuQoLLXb2NFI4PBX9ezkntpD2FEFJSpO+gPV/93KS3qmaAWaBa/nQ?= =?us-ascii?Q?lw/dmih0LkHEwE2yZat2AAXohEoHnwjgn01VzY/eIez501/08I9K1Enpig2X?= =?us-ascii?Q?xWA+wfX4el9mcntD+/KeVnmD8ZfcHEX7ap5CJKo+VNwyiT71nRpP4LAGzH5Q?= =?us-ascii?Q?lu0YTEmWcTuHvedvSEnBfJZf1gRdaaOr/nN3HcGtcuCFF6UMGHlZbLTmDYdx?= =?us-ascii?Q?29lidOZu4l7A3/u2lzkH7s0KHRbHSnRMBimYLMLPBhHNPxE6ddAeWN3Ptl5i?= =?us-ascii?Q?hNN3xNk0P5niO6v8rbkUDvbWZERJ+ppXvnyd15cgkeTS66Zwy2PKouimqdk6?= =?us-ascii?Q?LcPQygqufR5UIl650xHbjpxasmlRMW1+wye/Cvhuuy16XZ3Vggx/SFNd6Kr/?= =?us-ascii?Q?1o1zZIFchMw3IifiuRVRJnDCWhqkqY8q7UdAezgddLtVVk3KZKSZM5wUzTn7?= =?us-ascii?Q?M9QR/VHuoELt4fwnWD2aVKstv1He7X9phppTfv2p3gzj/5WXQn7NGhII/Sxs?= =?us-ascii?Q?3FErlMCZmu+CJUfjA2aNJfW5Zdo+UJFlEkYc5+XawlLquxi7nEoyZNLG1tTv?= =?us-ascii?Q?DBa7uw6YI3AAhRewQ9qsFpX/63ZKOzjZKUCZnfkiN6bpoa5STisflehW5Hcc?= =?us-ascii?Q?a5wM9PMnX5deSoZ9PgONRUM9VnXueurMJjgg2bU7o2FRgbZ00eEsPzvtABWS?= =?us-ascii?Q?uyyJ48uZPSSHYOXLUdweXBIEBMF9PJ6B37CAhyi3df7hCJHoF+WfxaAijUrr?= =?us-ascii?Q?iEqdek9dvDv8s3xYkkfZjmbWUCv7i3Deea+tfw4sY+lGzQM7EoQvmtpZ/+T+?= =?us-ascii?Q?3qgkcDC4crrV99x/tX4uVwZfnbaPFzYsPukJHvRDRB1XpG98R0Iho1pdGxXT?= =?us-ascii?Q?x89vjWEjRwV9VuR7Ii12/V1k3zIWHUdSGNmXzKv5UNlJJg8JwOAvqs3obHsx?= =?us-ascii?Q?BDzlJosqv7n7DW9Abpab3je3/YUBLvhQ6gJEZA8ZQzV1Hk+OTfoEhTHYYPUt?= =?us-ascii?Q?1CctduQCgUN2ZY6g/pPkWLI22cbhvluY+nXB835bkm9KUFg0Qje09Xbo2EUg?= =?us-ascii?Q?Glacfz5LBp1Fa1lwTL/LV8aLA6PwNHuGj6F++/WULKQMbUQUgysGKBHWvWjo?= =?us-ascii?Q?YIdj7c/yLAWXiMljhW1lg6V2WKlXqdYgSDJv5dfpxp7GyzBA4GbAxaeTnBpw?= =?us-ascii?Q?Q/aavlAdzLNDQ+YT31giYI/pFqmGev/coXBi33W1YOhE86jOPCnxhU0vJ0I9?= =?us-ascii?Q?CRlHTKYB3AMtIdioSWmH1THLnGnbaDeRTz7+r/yC1HlmoSnHgyUgVgkMqe1i?= =?us-ascii?Q?gc3vjA=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700013)(1800799024)(376014)(7416014)(82310400026);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:44:35.4772 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 3ede1100-8178-45d7-a409-08de5f44ec02 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055DC.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY3PR12MB9556 Content-Type: text/plain; charset="utf-8" Enable IBS memory access data collection for user memory accesses by programming the required MSRs. The profiling is turned ON only for user mode execution and turned OFF for kernel mode execution. Profiling is explicitly disabled for NMI handler too. TODOs: - IBS sampling rate is kept fixed for now. - Arch/vendor separation/isolation of the code needs relook. Signed-off-by: Bharata B Rao --- arch/x86/include/asm/entry-common.h | 3 +++ arch/x86/include/asm/hardirq.h | 2 ++ arch/x86/mm/ibs.c | 32 +++++++++++++++++++++++++++++ include/linux/pghot.h | 4 ++++ 4 files changed, 41 insertions(+) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/ent= ry-common.h index ce3eb6d5fdf9..0f381a63669e 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -4,6 +4,7 @@ =20 #include #include +#include =20 #include #include @@ -13,6 +14,7 @@ /* Check that the stack and regs on entry from user mode are sane. */ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) { + hwmem_access_profiling_stop(); if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) { /* * Make sure that the entry code gave us a sensible EFLAGS @@ -106,6 +108,7 @@ static inline void arch_exit_to_user_mode_prepare(struc= t pt_regs *regs, static __always_inline void arch_exit_to_user_mode(void) { amd_clear_divider(); + hwmem_access_profiling_start(); } #define arch_exit_to_user_mode arch_exit_to_user_mode =20 diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index 6b6d472baa0b..e80c305c17d1 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -91,4 +91,6 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(vo= id) static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { } #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */ =20 +#define arch_nmi_enter() hwmem_access_profiling_stop() +#define arch_nmi_exit() hwmem_access_profiling_start() #endif /* _ASM_X86_HARDIRQ_H */ diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c index 752f688375f9..d0d93f09432d 100644 --- a/arch/x86/mm/ibs.c +++ b/arch/x86/mm/ibs.c @@ -16,6 +16,7 @@ static u64 ibs_config __read_mostly; static u32 ibs_caps; =20 #define IBS_NR_SAMPLES 150 +#define IBS_SAMPLE_PERIOD 10000 =20 /* * Basic access info captured for each memory access. @@ -43,6 +44,36 @@ struct ibs_sample_pcpu __percpu *ibs_s; static struct work_struct ibs_work; static struct irq_work ibs_irq_work; =20 +void hwmem_access_profiling_stop(void) +{ + u64 ops_ctl; + + if (!hwmem_access_profiling) + return; + + rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl); + wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE); +} + +void hwmem_access_profiling_start(void) +{ + u64 config =3D 0; + unsigned int period =3D IBS_SAMPLE_PERIOD; + + if (!hwmem_access_profiling) + return; + + /* Disable IBS for kernel thread */ + if (!current->mm) + goto out; + + config =3D (period >> 4) & IBS_OP_MAX_CNT; + config |=3D (period & IBS_OP_MAX_CNT_EXT_MASK); + config |=3D ibs_config; +out: + wrmsrl(MSR_AMD64_IBSOPCTL, config); +} + bool hwmem_access_profiler_inuse(void) { return hwmem_access_profiling; @@ -310,6 +341,7 @@ static int __init ibs_access_profiling_init(void) x86_amd_ibs_access_profile_startup, x86_amd_ibs_access_profile_teardown); =20 + hwmem_access_profiling =3D true; pr_info("IBS setup for memory access profiling\n"); return 0; } diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 20ea9767dbdd..603791183102 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -6,8 +6,12 @@ =20 #ifdef CONFIG_HWMEM_PROFILER bool hwmem_access_profiler_inuse(void); +void hwmem_access_profiling_start(void); +void hwmem_access_profiling_stop(void); #else static inline bool hwmem_access_profiler_inuse(void) { return false; } +static inline void hwmem_access_profiling_start(void) {} +static inline void hwmem_access_profiling_stop(void) {} #endif =20 /* Page hotness temperature sources */ --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from CH1PR05CU001.outbound.protection.outlook.com (mail-northcentralusazon11010007.outbound.protection.outlook.com [52.101.193.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50BA531DD98 for ; Thu, 29 Jan 2026 14:45:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.193.7 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697923; cv=fail; b=t5kufcNkd3ILC8gJxUWA9QHa+B8pk+/WG/rM/9pR4K/Eb/CrIRgshXHB4O6F6VzEZOhlI11ul8lX8eOMehcXl9+//ejVe/NsglvbkoOEuOZ0ixfqagrkkCn8U7bEbm1n0c9by5hx+mWfLCbiozjxgQgmdiC6f3XpEz2CtQ/7nAA= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697923; c=relaxed/simple; bh=X/2uwItvV8yU3EK9+V4wQPTjJvwP2yXQ9C27HyoEsQQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=cuohlPJ+rW3YQ8hT59r/FhBaDqlr8He87yWKNG9D2NVetBp9ZjP0jXRgp82cnxHDaRTEP03MpyD2yc2hBCFXZH/zEI9djbm9YzIuPlAml1MwBjnCnHXVs8nk7cciURdf6JhX7x+BCB8paPRWs6WJGLYKlPZ/7Xoq/gQHFU1OPU8= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=pmlVn5a6; arc=fail smtp.client-ip=52.101.193.7 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="pmlVn5a6" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ggd8glBSghQcDWgQDMlSh6SNhrskmlAwKHGm7+opvlMGMwC3pdgUaBvISPnvXL8YKeu02zCICNU2DhA7GAXfZI9GraTvaBpx7aKxBsAHHTXAaCaFhHjAXWmi96YmIxC/PzYEJiQ+dsCwKSc9hnB4EPS55ImbPw6cfO5rS8zHK1wUjAHYmgDFJfYi8+41rC1vBu6xLZYCvwA3SiJXNp945UfVMNfO9THFPYvJc7lAYXsGJoiq7LAdCOQeGkabsgewdblT7m6turC/YyAJ35PVBMBCpIWSFurqp2o+OMifdSE960RCI9TIzIf3cEBKw7FozuholznZ4mRyKPIaBq8x7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=tfWfQ+Cg7ulJHmxxSIdfga9GQAvS7WeD3OfOHuo0yI8=; b=kY+LuengKpUj1fOIDnzLZ9cuhQKHP7cud3IRZYlGAV/F8lm+lTX1x+th+BggCZTuNLilpyI2n4RFZdghvKB1+tCUmwmPKt8gzOFCHNuZOYRtDc9n61i3mR9SEHZgHjonvtlUcWgAAkU0QRy3GMv0S2c+xy5u12WbRD2QKUG4Q7aPQ5gKFm/6s05QBFiSPqC+aJpccHW0WC2V8VPmp405FjjpJBfkR7BgX9pGAtKmm5I70seuaqe2Z0rTBJFK2u4ZuqeZ9OXF/ljpHWf1FQXRHEerMxjphvVYJTsK0IcgznqiioARK1vPujB0fAEfalR8g4k5J3xHUDUaGts5H+ecEw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=tfWfQ+Cg7ulJHmxxSIdfga9GQAvS7WeD3OfOHuo0yI8=; b=pmlVn5a6TKIydhB6MbEPbPUEr1V+x3SI3UYXKPFr60x/PG7/A2gl8aPDV0ikNxATqCX0U+l4lkwb32ei9GQSR4mlTr3JByCcq7uiFHoXdAu7t6gsoeW8TvSCgVXdhVSj+/mBqHxJhuWWr3XpUpSMnzF7/mRCZlWzaGs9l2NQkds= Received: from BL1PR13CA0372.namprd13.prod.outlook.com (2603:10b6:208:2c0::17) by CH3PR12MB9342.namprd12.prod.outlook.com (2603:10b6:610:1cb::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.7; Thu, 29 Jan 2026 14:45:09 +0000 Received: from BN2PEPF000055DC.namprd21.prod.outlook.com (2603:10b6:208:2c0:cafe::8c) by BL1PR13CA0372.outlook.office365.com (2603:10b6:208:2c0::17) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.7 via Frontend Transport; Thu, 29 Jan 2026 14:45:09 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055DC.mail.protection.outlook.com (10.167.245.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:45:09 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:44:57 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 08/10] mm: mglru: generalize page table walk Date: Thu, 29 Jan 2026 20:10:41 +0530 Message-ID: <20260129144043.231636-9-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055DC:EE_|CH3PR12MB9342:EE_ X-MS-Office365-Filtering-Correlation-Id: 54ea41f1-1236-48bd-6db2-08de5f45001e X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|1800799024|82310400026|36860700013; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?Xz5Z18ig68zuOGcOfRkK1+TTYPEeIllv6L0ScjI3S94SDJjoP3/lOLGrHFMI?= =?us-ascii?Q?W9z1LKgAt/qDQv/AkxjbvdzZ8XCXlxZw2L+PWNvfzEjY7dQLkphCEXkXx2Tv?= =?us-ascii?Q?gjepXpagP4ROif6v5HqiujfVBG98erYzOngL6/SXyeGoBAIZrCBuLzB8Oj7z?= =?us-ascii?Q?XuKiO+3bpzml4S9oh4qKd1xBNTDVRC8do+S/Xob5dgkiMWg7RpcWGaC6ijdt?= =?us-ascii?Q?oy6mN9qQkifUbAiE68y+eYtpf1w/uHrMzNePN38LWtUCQfALSPQietm+BM+h?= =?us-ascii?Q?HgawoaazqZOpqndPxwhWipxdenGp2/5KcO49uR27nu0XEfvUxu9d7yo6wov+?= =?us-ascii?Q?+b2Ql+skZF6GGpzQpY5cKj675jUcpRd5xPDHCvSuX33QcMRR7wSHdQAVta98?= =?us-ascii?Q?LCXYp036LmdHxsXTzfUssmjQCWDw/cVkGi1f9UUUf7bHl7HOksgz/b6St66/?= =?us-ascii?Q?eFL9Fw9DUGs5T4QJ/nM/z9z0It5wXxs73mXn1E6tJGXy88+Lyt+1sIi/AuIC?= =?us-ascii?Q?VLxT5ED921lc+kc9cZsnsW1NxaJ4ClvxsxUXYKGDuuHJn8aY5t7neUL0DqFf?= =?us-ascii?Q?tLTdSJ4TJuEddA7JHpL65lG34KWo/wixiYn471pM1kyf7KDk+ZQ5yMReXEgq?= =?us-ascii?Q?sgF8yX7Bngdqj89tGWGbvJ2y5Wuh4kgixD219EXnZwJWhzzbR+fenETVezzR?= =?us-ascii?Q?gN5eMgv7wfFAFuzrtWketCiMUsp53bkGeOUvoh1ncdc/rAMM0rVkz2DZZg5+?= =?us-ascii?Q?mQsOJ69bWeeSIl6E0F22iCELRuPcyr04Zm81/Xr0dPBQTRIKW7u4ZlvxzDX7?= =?us-ascii?Q?zlMK9LbJckcFGKjNbjnPU4hHrWSmckqladX4uAfsr+icKMHtM34t+fBd38hS?= =?us-ascii?Q?ODDY0WNKowUSwI6W5j27JHeuVijAqN75ySDIdVdXhVNu1QBUtRJAtfIGgR58?= =?us-ascii?Q?f5PyWC6oFWE9aj8pRbe4j+8a0vI0se7Ut6P9DcvbMigwIHY4V+hbz/w17LLp?= =?us-ascii?Q?T+gM72vYjHgOMx3UXQ42NFHB4v3XD85HYvSAMQAQfO0mSFcLhgc4hQnYqMta?= =?us-ascii?Q?/rqgGCSQ0F4XcNP89RqeFVIrWYSf8HD00MFAAoODt1w0SWSljnBjExYvXtMY?= =?us-ascii?Q?mNRG/YXJP2szdXqTif5y0N/GuZCk3UBA/DkjTzViZ+E8CTfIJR5uc8PFNlzk?= =?us-ascii?Q?SRznOArIMc1q4FZqV0WngJcrQ9lElkk5dA1jRAOBXsQULZ+bVCR806cgPtPw?= =?us-ascii?Q?WnV2Wn0Aq85u/nQ+p3rkuI+DdVRmpFxRszX7smXcR02nAN4QxdUws9oD+iLc?= =?us-ascii?Q?/AEgUXryrc36xCs2Twi74MbDYFrAFPh6k4N2YN0HwKmmG6oH8poTTTebxYZp?= =?us-ascii?Q?xEvM+wr9+opn2JyGZWiR9RTkpVqGcTTrcyGlsk6MHlzQtbQirDpxTLNcytii?= =?us-ascii?Q?gLjvwBnKqQAKxhGIIG+SURj1KqZUcThACsF22Y3P+MFaLC9B/VG6M9WjQ9qh?= =?us-ascii?Q?0SX5zUZIlAiUMmse2oe04TH688/pJPtEcxQQNsXp3uYA6DKDA/0OBDF8Aq1R?= =?us-ascii?Q?k64dqOzRTv8KFxzG3CXFpZlaB4ZR/sIqfpiDxm8Ibm9+/fnc/+KIkhiB2Q1i?= =?us-ascii?Q?B+b+LgwLSA+GhFSocCrzCO7Sdjb3giX0J4a1KjYR9BX6WonV1bakn3CWU1JR?= =?us-ascii?Q?/GYh6A=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(376014)(1800799024)(82310400026)(36860700013);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:45:09.2149 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 54ea41f1-1236-48bd-6db2-08de5f45001e X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055DC.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB9342 Content-Type: text/plain; charset="utf-8" From: Kinsey Ho Refactor the existing MGLRU page table walking logic to make it resumable. Additionally, introduce two hooks into the MGLRU page table walk: accessed callback and flush callback. The accessed callback is called for each accessed page detected via the scanned accessed bit. The flush callback is called when the accessed callback reports that a flush is required. This allows for processing pages in batches for efficiency. With a generalised page table walk, introduce a new scan function which repeatedly scans on the same young generation and does not add a new young generation. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie Signed-off-by: Bharata B Rao --- include/linux/mmzone.h | 5 ++ mm/internal.h | 4 + mm/vmscan.c | 181 +++++++++++++++++++++++++++++++---------- 3 files changed, 145 insertions(+), 45 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 49c374064fc2..26350a4951ff 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -548,6 +548,8 @@ struct lru_gen_mm_walk { unsigned long seq; /* the next address within an mm to scan */ unsigned long next_addr; + /* called for each accessed pte/pmd */ + bool (*accessed_cb)(unsigned long pfn); /* to batch promoted pages */ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; /* to batch the mm stats */ @@ -555,6 +557,9 @@ struct lru_gen_mm_walk { /* total batched items */ int batched; int swappiness; + /* for the pmd under scanning */ + int nr_young_pte; + int nr_total_pte; bool force_scan; }; =20 diff --git a/mm/internal.h b/mm/internal.h index e430da900430..426db1ae286f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -538,6 +538,10 @@ extern unsigned long highest_memmap_pfn; bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state = reason); +void set_task_reclaim_state(struct task_struct *task, + struct reclaim_state *rs); +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + bool (*accessed_cb)(unsigned long), void (*flush_cb)(void)); #ifdef CONFIG_NUMA int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat); diff --git a/mm/vmscan.c b/mm/vmscan.c index 670fe9fae5ba..02f3dd128638 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -289,7 +289,7 @@ static int sc_swappiness(struct scan_control *sc, struc= t mem_cgroup *memcg) continue; \ else =20 -static void set_task_reclaim_state(struct task_struct *task, +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ @@ -3058,7 +3058,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *w= alk, struct mm_struct **ite =20 VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq); =20 - if (walk->seq <=3D mm_state->seq) + if (!walk->accessed_cb && walk->seq <=3D mm_state->seq) goto done; =20 if (!mm_state->head) @@ -3484,16 +3484,14 @@ static void walk_update_folio(struct lru_gen_mm_wal= k *walk, struct folio *folio, } } =20 -static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long = end, - struct mm_walk *args) +static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long e= nd, + struct mm_walk *args, bool *suitable) { int i; bool dirty; pte_t *pte; spinlock_t *ptl; unsigned long addr; - int total =3D 0; - int young =3D 0; struct folio *last =3D NULL; struct lru_gen_mm_walk *walk =3D args->private; struct mem_cgroup *memcg =3D lruvec_memcg(walk->lruvec); @@ -3501,19 +3499,24 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, DEFINE_MAX_SEQ(walk->lruvec); int gen =3D lru_gen_from_seq(max_seq); pmd_t pmdval; + int err =3D 0; =20 pte =3D pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval= , &ptl); - if (!pte) - return false; + if (!pte) { + *suitable =3D false; + return err; + } =20 if (!spin_trylock(ptl)) { pte_unmap(pte); - return true; + *suitable =3D true; + return err; } =20 if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) { pte_unmap_unlock(pte, ptl); - return false; + *suitable =3D false; + return err; } =20 arch_enter_lazy_mmu_mode(); @@ -3522,8 +3525,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, unsigned long pfn; struct folio *folio; pte_t ptent =3D ptep_get(pte + i); + bool do_flush; =20 - total++; + walk->nr_total_pte++; walk->mm_stats[MM_LEAF_TOTAL]++; =20 pfn =3D get_pte_pfn(ptent, args->vma, addr, pgdat); @@ -3547,23 +3551,36 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, if (pte_dirty(ptent)) dirty =3D true; =20 - young++; + walk->nr_young_pte++; walk->mm_stats[MM_LEAF_YOUNG]++; + + if (!walk->accessed_cb) + continue; + + do_flush =3D walk->accessed_cb(pfn); + if (do_flush) { + walk->next_addr =3D addr + PAGE_SIZE; + + err =3D -EAGAIN; + break; + } } =20 walk_update_folio(walk, last, gen, dirty); last =3D NULL; =20 - if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &= end)) + if (!err && i < PTRS_PER_PTE && + get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) goto restart; =20 arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte, ptl); =20 - return suitable_to_scan(total, young); + *suitable =3D suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte); + return err; } =20 -static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct v= m_area_struct *vma, +static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm= _area_struct *vma, struct mm_walk *args, unsigned long *bitmap, unsigned long *first) { int i; @@ -3576,6 +3593,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigne= d long addr, struct vm_area struct pglist_data *pgdat =3D lruvec_pgdat(walk->lruvec); DEFINE_MAX_SEQ(walk->lruvec); int gen =3D lru_gen_from_seq(max_seq); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3583,13 +3601,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsig= ned long addr, struct vm_area if (*first =3D=3D -1) { *first =3D addr; bitmap_zero(bitmap, MIN_LRU_BATCH); - return; + return err; } =20 i =3D addr =3D=3D -1 ? 0 : pmd_index(addr) - pmd_index(*first); if (i && i <=3D MIN_LRU_BATCH) { __set_bit(i - 1, bitmap); - return; + return err; } =20 pmd =3D pmd_offset(pud, *first); @@ -3603,6 +3621,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigne= d long addr, struct vm_area do { unsigned long pfn; struct folio *folio; + bool do_flush; =20 /* don't round down the first address */ addr =3D i ? (*first & PMD_MASK) + i * PMD_SIZE : *first; @@ -3639,6 +3658,17 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area dirty =3D true; =20 walk->mm_stats[MM_LEAF_YOUNG]++; + if (!walk->accessed_cb) + goto next; + + do_flush =3D walk->accessed_cb(pfn); + if (do_flush) { + i =3D find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1; + + walk->next_addr =3D (*first & PMD_MASK) + i * PMD_SIZE; + err =3D -EAGAIN; + break; + } next: i =3D i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + = 1; } while (i <=3D MIN_LRU_BATCH); @@ -3649,9 +3679,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area spin_unlock(ptl); done: *first =3D -1; + return err; } =20 -static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long = end, +static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long e= nd, struct mm_walk *args) { int i; @@ -3663,6 +3694,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, unsigned long first =3D -1; struct lru_gen_mm_walk *walk =3D args->private; struct lru_gen_mm_state *mm_state =3D get_mm_state(walk->lruvec); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3676,6 +3708,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, /* walk_pte_range() may call get_next_vma() */ vma =3D args->vma; for (i =3D pmd_index(start), addr =3D start; addr !=3D end; i++, addr =3D= next) { + bool suitable; pmd_t val =3D pmdp_get_lockless(pmd + i); =20 next =3D pmd_addr_end(addr, end); @@ -3692,7 +3725,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long= start, unsigned long end, walk->mm_stats[MM_LEAF_TOTAL]++; =20 if (pfn !=3D -1) - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; continue; } =20 @@ -3701,33 +3737,51 @@ static void walk_pmd_range(pud_t *pud, unsigned lon= g start, unsigned long end, if (!pmd_young(val)) continue; =20 - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; } =20 if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i= )) continue; =20 + err =3D walk_pte_range(&val, addr, next, args, &suitable); + if (err && walk->next_addr < next && first =3D=3D -1) + return err; + + walk->nr_total_pte =3D 0; + walk->nr_young_pte =3D 0; + walk->mm_stats[MM_NONLEAF_FOUND]++; =20 - if (!walk_pte_range(&val, addr, next, args)) - continue; + if (!suitable) + goto next; =20 walk->mm_stats[MM_NONLEAF_ADDED]++; =20 /* carry over to the next generation */ update_bloom_filter(mm_state, walk->seq + 1, pmd + i); +next: + if (err) { + walk->next_addr =3D first; + return err; + } } =20 - walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); =20 - if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &e= nd)) + if (!err && i < PTRS_PER_PMD && + get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end)) goto restart; + + return err; } =20 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long e= nd, struct mm_walk *args) { - int i; + int i, err; pud_t *pud; unsigned long addr; unsigned long next; @@ -3745,7 +3799,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long s= tart, unsigned long end, if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) continue; =20 - walk_pmd_range(&val, addr, next, args); + err =3D walk_pmd_range(&val, addr, next, args); + if (err) + return err; =20 if (need_resched() || walk->batched >=3D MAX_LRU_BATCH) { end =3D (addr | ~PUD_MASK) + 1; @@ -3766,40 +3822,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long= start, unsigned long end, return -EAGAIN; } =20 -static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) { + int err; static const struct mm_walk_ops mm_walk_ops =3D { .test_walk =3D should_skip_vma, .p4d_entry =3D walk_pud_range, .walk_lock =3D PGWALK_RDLOCK, }; - int err; struct lruvec *lruvec =3D walk->lruvec; =20 - walk->next_addr =3D FIRST_USER_ADDRESS; + DEFINE_MAX_SEQ(lruvec); =20 - do { - DEFINE_MAX_SEQ(lruvec); + err =3D -EBUSY; =20 - err =3D -EBUSY; + /* another thread might have called inc_max_seq() */ + if (walk->seq !=3D max_seq) + return err; =20 - /* another thread might have called inc_max_seq() */ - if (walk->seq !=3D max_seq) - break; + /* the caller might be holding the lock for write */ + if (mmap_read_trylock(mm)) { + err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, + &mm_walk_ops, walk); =20 - /* the caller might be holding the lock for write */ - if (mmap_read_trylock(mm)) { - err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, w= alk); + mmap_read_unlock(mm); + } =20 - mmap_read_unlock(mm); - } + if (walk->batched) { + spin_lock_irq(&lruvec->lru_lock); + reset_batch_size(walk); + spin_unlock_irq(&lruvec->lru_lock); + } =20 - if (walk->batched) { - spin_lock_irq(&lruvec->lru_lock); - reset_batch_size(walk); - spin_unlock_irq(&lruvec->lru_lock); - } + return err; +} + +static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + int err; =20 + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); cond_resched(); } while (err =3D=3D -EAGAIN); } @@ -4011,6 +4075,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsig= ned long seq, int swappiness return success; } =20 +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + bool (*accessed_cb)(unsigned long), void (*flush_cb)(void)) +{ + struct lru_gen_mm_walk *walk =3D current->reclaim_state->mm_walk; + struct mm_struct *mm =3D NULL; + + walk->lruvec =3D lruvec; + walk->seq =3D seq; + walk->accessed_cb =3D accessed_cb; + walk->swappiness =3D MAX_SWAPPINESS; + + do { + int err =3D -EBUSY; + + iterate_mm_list(walk, &mm); + if (!mm) + break; + + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); + cond_resched(); + flush_cb(); + } while (err =3D=3D -EAGAIN); + } while (mm); +} + static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness, bool force_scan) { --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from SJ2PR03CU001.outbound.protection.outlook.com (mail-westusazon11012062.outbound.protection.outlook.com [52.101.43.62]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E2AF31DD98 for ; Thu, 29 Jan 2026 14:45:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.43.62 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697936; cv=fail; b=Gc7g79PXu+Roari2Az7o0bjpGoINppuiEp0iJNSsPWoYlWA/mfHK2bf6cMNtTaIdZ0y7HA5YCpnrCtxG+5dDQ+h29YIELXTm4ZGGPvgtQFKRd4GWDSwC1pPAvQopHblgvNl9ijbnhNIGQQLKNILR2AFsGAftgZcunHy2nXi6Aak= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697936; c=relaxed/simple; bh=dU5bbCd+UIrpnsQMAo1A6FoyLCOFy7Om28wQquErE9M=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=M3nqwupUHuLdZwfaB+qH6aosTHeng32hhly6iqSMnPRhA3vNycdRSsb4OT+6GDkuu6OJkrhLtmFFAcUAUUFFwn+uzRHz0dCHmDr+wmWIS8Z5nCChryHEzznIfWk1uRC7p7/AXdSTeEEXw2sYK5urtrvV/a6IDn+yLumStmGbILY= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=cnoy0QGr; arc=fail smtp.client-ip=52.101.43.62 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="cnoy0QGr" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=rLkqCZAnEZN8p77QOam2PsopKyX2aZcC9C2Df8QUBPXxqsRW0w8fLS4uhh4TglRUQo4As2Pmo127IwQ1OhcavF32WpoyKeCoyxcCbxtDce3xox1Fu1ZrCmwTMSLlem+4QSfn0q7gdW1YgLhk7BebpoUg4F4VjB9yAWWl/mtXQTK5HsJO0TXFqcMXLalov7g6OdTIHbHr1yx/al2tqdsJGUZKOgoWFpE+I3iQ5AZZT4JjBt9N9Sd7EHZw0vJK8deFM/lH/Mkptrk1g5AQrGn2wYNj6pfeSgwz2pY2DnK2+Pf3AXZpgOaIHgYaoQ7UZHup6hk6sr0PJFlZWlUZ59oArQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=jrhZvi5lUR5xhpRCQztNG0dMlnwfthGgJ82DPdnCgXY=; b=a4v7kW4b+56pLytCFDIYg8H+2EaN4w74Zj4TdXubKwOxMEQkSpK1Ks3YtOrSf2l7VxbZ28nx2LNfGY1sg64JoVNhC/kV/0Sc6pQQL2yPQsiB4uSEAyZAWMOzTeAPmss+tQyeP2DriASWBB8XuMc9eqaJwSktwtVm4u/E2/bSimKq561a62n9oROgeCPQQgImoEV10BcHTcyzmXznM6YV0lFHQ0gwjvxQooEH6YndeoLy1khlEJbEGZeHTqe0QdtkCI3gFlgYR7ycMf0o+RvanNBjyXwAyEjcOsHh43IUNfXilsqFpCmV0eMTSIDizvz3mqRPAZ66AlLUg49Erjy9eQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jrhZvi5lUR5xhpRCQztNG0dMlnwfthGgJ82DPdnCgXY=; b=cnoy0QGr7m7gin8mmUwez4aywKuD1RAFPv21dp8l2yl/+Zlz4akhmRsWlqE0gA7cSJVDDwQFe13+0ntmO6PllSe4R4yryBry3qctug1MA4KTusFGi2ImGWIYRUcAcQbMBWZJ1m0lYbLE1Ppb1VXy12g9wgM1blzOsJMjOm06+YE= Received: from SJ0PR05CA0181.namprd05.prod.outlook.com (2603:10b6:a03:330::6) by CH2PR12MB4136.namprd12.prod.outlook.com (2603:10b6:610:a4::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.11; Thu, 29 Jan 2026 14:45:30 +0000 Received: from SJ5PEPF000001C9.namprd05.prod.outlook.com (2603:10b6:a03:330:cafe::ef) by SJ0PR05CA0181.outlook.office365.com (2603:10b6:a03:330::6) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9564.7 via Frontend Transport; Thu, 29 Jan 2026 14:45:32 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SJ5PEPF000001C9.mail.protection.outlook.com (10.167.242.37) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9564.3 via Frontend Transport; Thu, 29 Jan 2026 14:45:29 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:45:22 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 09/10] mm: klruscand: use mglru scanning for page promotion Date: Thu, 29 Jan 2026 20:10:42 +0530 Message-ID: <20260129144043.231636-10-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ5PEPF000001C9:EE_|CH2PR12MB4136:EE_ X-MS-Office365-Filtering-Correlation-Id: a61017e7-f960-49fb-d3e5-08de5f450c85 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|82310400026|376014|1800799024|36860700013; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?cwnqM4I2Oo0fkFmju+yD9koMiT999G1yqkBqVE5OSyxdO7ECR9KuQZVade4W?= =?us-ascii?Q?KDF3OIX/wsJ2inA8rSZqJglVE4sAS/YYJhb3BYiKOKoIMVO6BXV9qTpxxnxs?= =?us-ascii?Q?44nqAiltDYOeZYn/2qo/4A2LcTfwnF1lxDiMD8Nr/ChVKBQIA3oI0l5sqVo3?= =?us-ascii?Q?VfnskA1EPTNG7aFLIb87DnT71HZ/doOpynubE4beOyEQXqc2S+lHt1QTgTlF?= =?us-ascii?Q?s0INPbOcv0y6QJNHP27HiZM9hEl8DrGZicKuIRe38MV7JnHovujIvrvqP47H?= =?us-ascii?Q?3K5L533zJSkDU1HoZBCri8xLDDRkhj9wU589E3CvXjb6yJp+P4mHnZEmNiYK?= =?us-ascii?Q?l/YdTKce9eHNUWavk92fE6NvGZCn5iPUmcwz0jLPRrTjrD4vCV+9JRuQhoDQ?= =?us-ascii?Q?vbD/OsXpgbe+n9k4fjdaIECaMrrlen3DtZemMzL41SXfJjzybNdUJ2rnnU6M?= =?us-ascii?Q?gu34UVzGWNf0LkfP8m7+2d5jeVgk1sBNmg9tCo+BdnDDfxiY1MUh0SsR0FWq?= =?us-ascii?Q?3uPRgIuXn8QeGicwSBvC7htgeNyz/0w/grpeNSTfGzNxdPP+MeK6hd0B+sUw?= =?us-ascii?Q?Ku87ZvWhYyqh40NZa+mv2zLcaHpDxskmrgt9aRYQge15FjmT0r8MuNbPHr4v?= =?us-ascii?Q?y4CizfpGT/Qf9ETRs5mvuH3TjRUpa9bjOcuduRRquNHJcU0dah5U/pPjMN6G?= =?us-ascii?Q?y6SNbuUt/8WU3akPrKXVXe5JJ3XXofMdb12AQ4rxwkacv3FLfVAiygGJFHBR?= =?us-ascii?Q?Ex5rnXGGB7TjgD692oXIsJlijTDQGonY29JwLnAn749oyuJBSjAPf3dJsRcI?= =?us-ascii?Q?XJdpkS8j8suMccHiOS8rrRmsuqxXaqXQR3Ty6laGOAe2IU3l5YhtXQfmz5bN?= =?us-ascii?Q?A4YntRzH+4b7k/ZPRQGAl18ItsISFtSxI9g5cLbExjatEpm8Zxrbbfb+MxyN?= =?us-ascii?Q?+rcxe4v9Zs8uP4G/6XizCQYJgpZXETXyBnvRocbsAwIEKmtILaOdmRCiIUzD?= =?us-ascii?Q?8H7TkOs6pHx1865jJ8iZ5+025l/I0lkOAZC4koMeHfIileCNAUPY2VRFGq9J?= =?us-ascii?Q?pdx0pWv/CVeGeN6NK1HCX4dHE5IuneM+l4Rv9buGZTSLuhH41Nqm/Pr0ZDL4?= =?us-ascii?Q?9UYydVyr7SiUMVyAjviilN7WENDA+4SoigOwPMzW1OWggNQhwsE4UMhiYR7V?= =?us-ascii?Q?o2fXIozl5VW10I0H1OSB6NQ4YBbAE5/I3V7XVOHhFINzGllOqi0MrGOgfnMw?= =?us-ascii?Q?q9TiLheH+L77mmJaeOq0Qte46ix4GNjBDW/v+njsl7vF4v5g0qxUniPDcxZD?= =?us-ascii?Q?ESX7qBg5GHSAVpV8uhVJqbq76hwQ+IOg91lWh+z0AAUCifQsGlPcKJxzV/O2?= =?us-ascii?Q?JCiPHE36A0IVBNyrRsqp8OqnR+mg0yS0HSgTskeSQVYw8Ga7mNdJL1jnV7HU?= =?us-ascii?Q?4dAVb57hW3n+tHJ6XY31skEU/TixN4cZGCc0EXCkmvVZklFjU2hcJIGgnJQk?= =?us-ascii?Q?6x94qonikCecSidk78UmtFym+/XWTrJeqQ8Vigv5pdJsfAOD+OGpA0qFXITA?= =?us-ascii?Q?jJlUXbZuCxlkzcFNOox7eVBkV6FdkRMwoI2RRFYJ9f1qBxoM6zDk7ZhRVKNB?= =?us-ascii?Q?XbPvLDN9W7SSy/0kuRzsr8yrFsnktaUUG+ecS5jF2s16ilvewas1ur9utngA?= =?us-ascii?Q?JDe6uw=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(82310400026)(376014)(1800799024)(36860700013);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:45:29.9321 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: a61017e7-f960-49fb-d3e5-08de5f450c85 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ5PEPF000001C9.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH2PR12MB4136 Content-Type: text/plain; charset="utf-8" From: Kinsey Ho Introduce a new kernel daemon, klruscand, that periodically invokes the MGLRU page table walk. It leverages the new callbacks to gather access information and forwards it to pghot sub-system for promotion decisions. This benefits from reusing the existing MGLRU page table walk infrastructure, which is optimized with features such as hierarchical scanning and bloom filters to reduce CPU overhead. As an additional optimization to be added in the future, we can tune the scan intervals for each memcg. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie [Reduced the scan interval to 500ms, KLRUSCAND to default n in config] Signed-off-by: Bharata B Rao --- mm/Kconfig | 8 ++++ mm/Makefile | 1 + mm/klruscand.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 119 insertions(+) create mode 100644 mm/klruscand.c diff --git a/mm/Kconfig b/mm/Kconfig index 07b16aece877..9e9eca8db8bf 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1502,6 +1502,14 @@ config HWMEM_PROFILER rolled up to PGHOT sub-system for further action like hot page promotion or NUMA Balancing =20 +config KLRUSCAND + bool "Kernel lower tier access scan daemon" + default n + depends on PGHOT && LRU_GEN_WALKS_MMU + help + Scan for accesses from lower tiers by invoking MGLRU to perform + page table walks. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 89f999647752..c68df497a063 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -153,3 +153,4 @@ obj-$(CONFIG_PGHOT) +=3D pghot-precise.o else obj-$(CONFIG_PGHOT) +=3D pghot-default.o endif +obj-$(CONFIG_KLRUSCAND) +=3D klruscand.o diff --git a/mm/klruscand.c b/mm/klruscand.c new file mode 100644 index 000000000000..13a41b38d67d --- /dev/null +++ b/mm/klruscand.c @@ -0,0 +1,110 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +#define KLRUSCAND_INTERVAL 500 +#define BATCH_SIZE (2 << 16) + +static struct task_struct *scan_thread; +static unsigned long pfn_batch[BATCH_SIZE]; +static int batch_index; + +static void flush_cb(void) +{ + int i; + + for (i =3D 0; i < batch_index; i++) { + unsigned long pfn =3D pfn_batch[i]; + + pghot_record_access(pfn, NUMA_NO_NODE, PGHOT_PGTABLE_SCAN, jiffies); + + if (i % 16 =3D=3D 0) + cond_resched(); + } + batch_index =3D 0; +} + +static bool accessed_cb(unsigned long pfn) +{ + WARN_ON_ONCE(batch_index =3D=3D BATCH_SIZE); + + if (batch_index < BATCH_SIZE) + pfn_batch[batch_index++] =3D pfn; + + return batch_index =3D=3D BATCH_SIZE; +} + +static int klruscand_run(void *unused) +{ + struct lru_gen_mm_walk *walk; + + walk =3D kzalloc(sizeof(*walk), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (!walk) + return -ENOMEM; + + while (!kthread_should_stop()) { + unsigned long next_wake_time; + long sleep_time; + struct mem_cgroup *memcg; + int flags; + int nid; + + next_wake_time =3D jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL); + + for_each_node_state(nid, N_MEMORY) { + pg_data_t *pgdat =3D NODE_DATA(nid); + struct reclaim_state rs =3D { 0 }; + + if (node_is_toptier(nid)) + continue; + + rs.mm_walk =3D walk; + set_task_reclaim_state(current, &rs); + flags =3D memalloc_noreclaim_save(); + + memcg =3D mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec =3D + mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq =3D + READ_ONCE((lruvec)->lrugen.max_seq); + + lru_gen_scan_lruvec(lruvec, max_seq, accessed_cb, flush_cb); + cond_resched(); + } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + memset(walk, 0, sizeof(*walk)); + } + + sleep_time =3D next_wake_time - jiffies; + if (sleep_time > 0 && sleep_time !=3D MAX_SCHEDULE_TIMEOUT) + schedule_timeout_idle(sleep_time); + } + kfree(walk); + return 0; +} + +static int __init klruscand_init(void) +{ + struct task_struct *task; + + task =3D kthread_run(klruscand_run, NULL, "klruscand"); + + if (IS_ERR(task)) { + pr_err("Failed to create klruscand kthread\n"); + return PTR_ERR(task); + } + + scan_thread =3D task; + return 0; +} +module_init(klruscand_init); --=20 2.34.1 From nobody Sat Feb 7 08:44:11 2026 Received: from PH8PR06CU001.outbound.protection.outlook.com (mail-westus3azon11012033.outbound.protection.outlook.com [40.107.209.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0094316904 for ; Thu, 29 Jan 2026 14:46:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.209.33 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697969; cv=fail; b=sj8g9SnpEco55LFceWJpM0OZA40ZpqCMAc3g87VB3kZMmsPveutQW+0FOLyzgOSMqzshsUu4FtiCratS6NwwmX6TkA/bmohv1D3rB3g9APGywf+6Z+y+4Ny6X5PNSZAyHV74xqX74Qfj8dF1pGdqBK1WbyIdzZlswTp1yuvF5hI= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769697969; c=relaxed/simple; bh=XS6pk5J+a0qXGZvgW2jCuVCOHOrjXyn52ljMbtDLsQ8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Xp/Zf+D2f+VDUBWoh6x50hgFpOPY/oGaplrZgHMPC89gxN4cGgwV2cSNJvqKJSlvCj0ZUzXOp2tmdfDligS5jl3rqyIgWUrGQiBORtGCRvq2ePcsOs5p1WEImukRLmvqRsOl3Ir9+SstikwoUoLdMqFOm3UdAkP+j3/90INsmlg= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=IRkbbFDC; arc=fail smtp.client-ip=40.107.209.33 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="IRkbbFDC" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=er++v9Evjxy3uB05dS1RlRa4mVBkc3dsS7gS6uLshufmPrrCN+xuQ3Tzfg6CZWzrmtL+UV6dtoEzBVGVDI0KMIweIvvbcs+ft66pkkUwTd9txKKwOotnw2UPoIKTdVyxQPyE0ymwii91+AnB0HOj9CEV4K6HtaIWnSNk0OR+MWMK2msZ4YxbH0hts3rdaCjn0yqxPVu9cJOibjfEJtbF0GitsBzDbFgrBSgS64bwLpS+LEK7b/665czoqgpu7paEU8/X9Z7ADhTGT2N6gZLXQ2DUGVk8fDeT9Iebf5M95s4dDWxiwmauCYzXG47dT1hW+614F77yA0a/SBUuQLFI2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rUzm6Hs5+cDmiHjR7KYrOgazFoFm5vKrZX+/+4jyRck=; b=f3d27apH01IuJQMtWcuCGYTY2TnOjoKbQKWPjTDJ8hkmZ+W8uyCMYbMOT+2PksUwK3Y38kPyGs1rY/ST9szXY3zBA3QUh2z/HttCiRRK3G8ilRTN4pASVe2JtFh2o//C7cc5vSEwoos7k9Z3yTtVfFc55BRnDUrbzCb9OA+p+wgRubwlezO9KmygLecWOS4Tcpr/rlOmecsnmZP5+oKQnpE/Ytg1Kj9caACUCbprhTxvqZf/v7aFSzoSibPIwvkC6Y+GKaX71Nqbfv2AGxPNK0Dpx/ot2d//h1Y9sJfpkvlhnv711xsv0jCjTygPh3pgNAeQoC/VFxUKt9Dul2kyIw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rUzm6Hs5+cDmiHjR7KYrOgazFoFm5vKrZX+/+4jyRck=; b=IRkbbFDCn4MOizfQitHnOvhpB5ws6/9wRbNeClJaPcYQgBik9+WlyLHdTwsm60935sxzb7N7o763Doh2MOuNwljxGd8VX7n0/lZxXLw6QZU1cdbQtiRZ16FnBMMOpF1fkBxOVeCKKxnbRsl+j4E4AMgjWV1GPmnXuHwhEJrjYC0= Received: from BN0PR04CA0106.namprd04.prod.outlook.com (2603:10b6:408:ec::21) by SA3PR12MB9198.namprd12.prod.outlook.com (2603:10b6:806:39f::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9542.14; Thu, 29 Jan 2026 14:45:57 +0000 Received: from BN2PEPF000055E0.namprd21.prod.outlook.com (2603:10b6:408:ec:cafe::af) by BN0PR04CA0106.outlook.office365.com (2603:10b6:408:ec::21) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9542.16 via Frontend Transport; Thu, 29 Jan 2026 14:45:28 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BN2PEPF000055E0.mail.protection.outlook.com (10.167.245.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9587.0 via Frontend Transport; Thu, 29 Jan 2026 14:45:54 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 29 Jan 2026 08:45:47 -0600 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v5 10/10] mm: pghot: Add folio_mark_accessed() as hotness source Date: Thu, 29 Jan 2026 20:10:43 +0530 Message-ID: <20260129144043.231636-11-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260129144043.231636-1-bharata@amd.com> References: <20260129144043.231636-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb08.amd.com (10.181.42.217) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN2PEPF000055E0:EE_|SA3PR12MB9198:EE_ X-MS-Office365-Filtering-Correlation-Id: 5028aa15-3d37-4d69-e51e-08de5f451b61 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|1800799024|36860700013|82310400026|13003099007; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?qFce0gvOTFTkT7ztL/buvPlWPFEQyLGlkKKSoo/6bRV4aLlyURb3ZuMSc5RN?= =?us-ascii?Q?L7H64cyxi9b7lxT9C1uSa9bKRN0GSdstTOUX1L3Nfgcdwu2Bf4Opv6rurMu7?= =?us-ascii?Q?uT9JQBArjIgWeUlGqbQuGConerHo32j9hTgxOGe2Vo4XK+fKo8rzyfIfL+JY?= =?us-ascii?Q?KIzvIUIRvYKwKDMRSZXI9ZDoKpKf+yAnCSx209yyaWqkUrUmJypcXYXTqy3Y?= =?us-ascii?Q?mwcM8ZNmFk8l+Yus4klt9jwyk3T6geoJ8z4ZQbGjMy9Eq5UKGIyCnaPenCP2?= =?us-ascii?Q?j9Y8Sq4wutFNvt0JWux5D1nBb6nvzPGzCkZou7p8fqOUbSobHBoyZ2iHB6bf?= =?us-ascii?Q?V//ZRnmH6WLJcV2vO/1ULwAZnowIWFOTX8a+G8+7BUcFGqOo7Awp0+eMzlaD?= =?us-ascii?Q?HRcCeQttDlvXWB1wnOWXO6LCIpJ4uzi0UTdwWyiZXWfWaStoFhuw9+5mfNhj?= =?us-ascii?Q?b50dB5Tc8loNhTm+7AkYS0NyYVQAd/KLaHCtCqaVEtdyG15uUPOwbwByn1eP?= =?us-ascii?Q?MIQf+U/Sh389KI5HZOM/QtEqxFFQeRF9Bm/CQDAiL6eVVasjk6sn9/tIzg5O?= =?us-ascii?Q?pEN2eMWgVa8IDwCpSNuNPRypAGvq1sv+2jqLXEipCH2he4vcwttsKo1UuW1B?= =?us-ascii?Q?leimTXYvKIr16fA1jq8L002ZCO/LjPl+cuhBtidmscdtIXj5uDt1v5ajUgh8?= =?us-ascii?Q?HWHl1XmXMltNuWRDlY7e0STXFxg+gXUANqXjDs/pH5dEW42Hb40HCEBoLHqL?= =?us-ascii?Q?VgBf1VNPpsOq0Oeax2aikWrGh4wIJIbO6nbXV9PhUQ84O6AgI1qK+HW2+bnF?= =?us-ascii?Q?py7skb0wxskT06gS54uVYVR3KzN7a3HDwvZ4iSFeYvlrb6LzDv3A3PXMgsOC?= =?us-ascii?Q?NTiiBs5LKqiEMyNv8JpR2gUseMhiLF4BOBUg5bzeh0aogx04QJCF8sQTrASZ?= =?us-ascii?Q?IYa0lLkG+bDLXcVBhvLuJbyMz+TNPvgXjk31pZRLvk5RumKBho0Hf5F24SK+?= =?us-ascii?Q?RovGHn7Uk9CYX77PqdFrUxCqUXZOcgzxLYVsXsVBlmgwPDX0xyYn3M/gnfXp?= =?us-ascii?Q?sAhe23bS8Y6ML0b32jWHpeytv1fyc9tRvJH6ljhTW67/Gn+oPLeereLmz4J/?= =?us-ascii?Q?9vdpWXWEREdWcIQ3JYI+VyO/aZ0BE4Isr3JDr3oAzMmdHZVl2/RFUIc6UY8L?= =?us-ascii?Q?MK8pJvY6Ol/dGYEAR4xvlUq+EaVT6MVsBt6q5yUaFmrpOfpxEqGG7p2J5gBC?= =?us-ascii?Q?8Jp+5jCbrodYaNbeWlhLZLSTMc9VXxQEh+RhdIpbxPwFMvOb47SXlU1JTc45?= =?us-ascii?Q?GY9eUB533Su6EHsHIVM25XuKmSjx90mj0m7fQUdlCTFhtmq6Itby6tyxzom+?= =?us-ascii?Q?Kz2ugVpBiKL0ykwtQxEggzcziN1qhRPmYYMrcwFlnO7CnUQwZrHcewH3Z+LM?= =?us-ascii?Q?ymWr6xfRSYebWKTHVXJ3P75Z/G4qtNe2hgmoYD8ny1EKgDpedS8MRXE9sE8Y?= =?us-ascii?Q?j9arJ2+7/SsuEa34mAUwrsFAZ0d8YI5mNUZrb3FaNkc7MhG7rkCvi7JHaYD4?= =?us-ascii?Q?0G2WUmXY15OJd/nti6M9N3bSrS2Vy6wBsmbQHj2Y1u7EVn38B9cxIF0MbGjt?= =?us-ascii?Q?mD4xygcczW/wSw4RZZpfueYP631EpBq0a8jw7LxgLozH3y6dOeqDQiLHxAC6?= =?us-ascii?Q?uswEtw=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(376014)(1800799024)(36860700013)(82310400026)(13003099007);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 29 Jan 2026 14:45:54.9547 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 5028aa15-3d37-4d69-e51e-08de5f451b61 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN2PEPF000055E0.namprd21.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA3PR12MB9198 Content-Type: text/plain; charset="utf-8" Unmapped page cache pages that end up in lower tiers don't get promoted easily. There were attempts to identify such pages and get them promoted as part of NUMA Balancing earlier [1]. The same idea is taken forward here by using folio_mark_accessed() as a source of hotness. Lower tier accesses from folio_mark_accessed() are reported to pghot sub-system for hotness tracking and subsequent promotion. TODO: Need a better naming for this hotness source. Need to better understand/evaluate the overhead of hotness info collection from this path. [1] https://lore.kernel.org/linux-mm/20250411221111.493193-1-gourry@gourry.= net/ Signed-off-by: Bharata B Rao --- Documentation/admin-guide/mm/pghot.txt | 7 ++++++- include/linux/pghot.h | 5 +++++ include/linux/vm_event_item.h | 1 + mm/pghot-tunables.c | 7 +++++++ mm/pghot.c | 6 ++++++ mm/swap.c | 8 ++++++++ mm/vmstat.c | 1 + 7 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/pghot.txt b/Documentation/admin-g= uide/mm/pghot.txt index b329e692ef89..c8eb61064247 100644 --- a/Documentation/admin-guide/mm/pghot.txt +++ b/Documentation/admin-guide/mm/pghot.txt @@ -23,9 +23,10 @@ Path: /sys/kernel/debug/pghot/ - 0: Hardware hints (value 0x1) - 1: Page table scan (value 0x2) - 2: Hint faults (value 0x4) + - 3: folio_mark_accessed (value 0x8) - Default: 0 (disabled) - Example: - # echo 0x7 > /sys/kernel/debug/pghot/enabled_sources + # echo 0xf > /sys/kernel/debug/pghot/enabled_sources Enables all sources. =20 2. **target_nid** @@ -82,3 +83,7 @@ Path: /proc/vmstat 4. **pghot_recorded_hintfaults** - Number of recorded accesses reported by NUMA Balancing based hotness source. + +5. **pghot_recorded_fma** + - Number of recorded accesses reported by folio_mark_accessed() + hotness source. diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 603791183102..8cf9dfb5365a 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -19,6 +19,7 @@ enum pghot_src { PGHOT_HW_HINTS, PGHOT_PGTABLE_SCAN, PGHOT_HINT_FAULT, + PGHOT_FMA, }; =20 #ifdef CONFIG_PGHOT @@ -36,6 +37,7 @@ void pghot_debug_init(void); DECLARE_STATIC_KEY_FALSE(pghot_src_hwhints); DECLARE_STATIC_KEY_FALSE(pghot_src_pgtscans); DECLARE_STATIC_KEY_FALSE(pghot_src_hintfaults); +DECLARE_STATIC_KEY_FALSE(pghot_src_fma); =20 /* * Bit positions to enable individual sources in pghot/records_enabled @@ -45,6 +47,7 @@ enum pghot_src_enabled { PGHOT_HWHINTS_BIT =3D 0, PGHOT_PGTSCAN_BIT, PGHOT_HINTFAULT_BIT, + PGHOT_FMA_BIT, PGHOT_MAX_BIT }; =20 @@ -52,6 +55,8 @@ enum pghot_src_enabled { #define PGHOT_PGTSCAN_ENABLED BIT(PGHOT_PGTSCAN_BIT) #define PGHOT_HINTFAULT_ENABLED BIT(PGHOT_HINTFAULT_BIT) #define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_MAX_BIT - 1, 0) +#define PGHOT_FMA_ENABLED BIT(PGHOT_FMA_BIT) +#define PGHOT_SRC_ENABLED_MASK GENMASK(PGHOT_MAX_BIT - 1, 0) =20 #define PGHOT_DEFAULT_FREQ_THRESHOLD 2 =20 diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 67efbca9051c..ac1f28646b9c 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -193,6 +193,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGHOT_RECORD_HWHINTS, PGHOT_RECORD_PGTSCANS, PGHOT_RECORD_HINTFAULTS, + PGHOT_RECORD_FMA, #ifdef CONFIG_HWMEM_PROFILER HWHINT_NR_EVENTS, HWHINT_KERNEL, diff --git a/mm/pghot-tunables.c b/mm/pghot-tunables.c index 79afbcb1e4f0..11c7f742a1be 100644 --- a/mm/pghot-tunables.c +++ b/mm/pghot-tunables.c @@ -124,6 +124,13 @@ static void pghot_src_enabled_update(unsigned int enab= led) else static_branch_disable(&pghot_src_hintfaults); } + + if (changed & PGHOT_FMA_ENABLED) { + if (enabled & PGHOT_FMA_ENABLED) + static_branch_enable(&pghot_src_fma); + else + static_branch_disable(&pghot_src_fma); + } } =20 static ssize_t pghot_src_enabled_write(struct file *filp, const char __use= r *ubuf, diff --git a/mm/pghot.c b/mm/pghot.c index 6fc76c1eaff8..537f4af816ff 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -43,6 +43,7 @@ static unsigned int sysctl_pghot_promote_rate_limit =3D 6= 5536; DEFINE_STATIC_KEY_FALSE(pghot_src_hwhints); DEFINE_STATIC_KEY_FALSE(pghot_src_pgtscans); DEFINE_STATIC_KEY_FALSE(pghot_src_hintfaults); +DEFINE_STATIC_KEY_FALSE(pghot_src_fma); =20 #ifdef CONFIG_SYSCTL static const struct ctl_table pghot_sysctls[] =3D { @@ -113,6 +114,11 @@ int pghot_record_access(unsigned long pfn, int nid, in= t src, unsigned long now) return -EINVAL; count_vm_event(PGHOT_RECORD_HINTFAULTS); break; + case PGHOT_FMA: + if (!static_branch_likely(&pghot_src_fma)) + return -EINVAL; + count_vm_event(PGHOT_RECORD_FMA); + break; default: return -EINVAL; } diff --git a/mm/swap.c b/mm/swap.c index 2260dcd2775e..31a654b19844 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -37,6 +37,8 @@ #include #include #include +#include +#include =20 #include "internal.h" =20 @@ -454,8 +456,14 @@ static bool lru_gen_clear_refs(struct folio *folio) */ void folio_mark_accessed(struct folio *folio) { + unsigned long pfn =3D folio_pfn(folio); + if (folio_test_dropbehind(folio)) return; + + if (!node_is_toptier(pfn_to_nid(pfn))) + pghot_record_access(pfn, NUMA_NO_NODE, PGHOT_FMA, jiffies); + if (lru_gen_enabled()) { lru_gen_inc_refs(folio); return; diff --git a/mm/vmstat.c b/mm/vmstat.c index 62c47f44edf0..c4d90baf440b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1506,6 +1506,7 @@ const char * const vmstat_text[] =3D { [I(PGHOT_RECORD_HWHINTS)] =3D "pghot_recorded_hwhints", [I(PGHOT_RECORD_PGTSCANS)] =3D "pghot_recorded_pgtscans", [I(PGHOT_RECORD_HINTFAULTS)] =3D "pghot_recorded_hintfaults", + [I(PGHOT_RECORD_FMA)] =3D "pghot_recorded_fma", #ifdef CONFIG_HWMEM_PROFILER [I(HWHINT_NR_EVENTS)] =3D "hwhint_nr_events", [I(HWHINT_KERNEL)] =3D "hwhint_kernel", --=20 2.34.1