From nobody Wed Sep 10 23:31:03 2025 Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-sn1nam02on2044.outbound.protection.outlook.com [40.107.96.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47B95335BBE for ; Wed, 10 Sep 2025 14:48:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.96.44 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515693; cv=fail; b=spqx9VLyYN2VfN7tC2kzVULhJkz/es1PDy541q13SZCkyHiE3/4NmcWvxclAd0hmhlRGcUjOROLS7i11jB3bp1b4TQAckVu/g9t4ZHk57LI8Swu9IIS2Qj7wTZfbg7nQX43KuRYTX4+KzmFdVh2/f1n8q5BC2h5W7vuIHjjB9PE= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515693; c=relaxed/simple; bh=0Mzfn8jVkDBGv+EPTCCwEKfprsAoJUHo7rLXtqELwlQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=fASsXNQKYQtP2+0VhO7SwWW50Jz8BskVhHD0VkdAiFYOPfMQFhiHmTjjEnDeRppaPAPg1z/nB9M3s1ubbLoiz/LWJQebqwrMCUNfKaWm6p9HPtAq+4xjO/eRNrXt6+7GdD2z4qTf7xMLzXIg4fL/ptNQvggufmswgnMg77qCoOA= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=S4Q1S9sH; arc=fail smtp.client-ip=40.107.96.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="S4Q1S9sH" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=TmiYE7kxdYnrDjzFLo6PRFZ8BXrv0P+PHXnuihtBk/zPmgOBshzGHISwDTcaGSaWwCHBLYwhy3/wylTdAhED2zoUYH26j710M2uckYDbe1UyMUctGrUq7gnO0P3Z5rduGxfBmgmt6hEeVxSJX0zqb9huhMBzj55rkTsw+LxWcoRsEtH4ghYD1qHTKMzI7Bq+N75bsbRPBWMcpe1loKySAZt+Ss4GhNWdPXOwu5fRxMMp4esfXRlVF/lX7uhoavZBmM5PBEZvKjmDiR2MMshZd+vIMA0Ki3sa5CHYehKZ9zrqNdRv6Rjm6vOGBlC4gZjDLwLWiVITm7zJM5rCoKd6Vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=59SFms/JUIGx4TGCwTHjqBFEnu6Eg/N9XnnXI3pfq4E=; b=sdXPzmAFgytxb1uKN0n9mgPtBSLztYuuxCqhWtQ7ZfiPnoY8bREQEIOEfKAwsvFzKAiolr9PaYIL+R6GT6d/lRrV5Ys/4GDinYEXmBUWwbgB5SI/ZvhzNCE2L/Am9MyhGUFAvoxGz5oJODXl1nHXcJq+whamKW1tkLVx222nwNNnsqn8/xem0xtHhvFziDf8U1UYGJ0klpx0kPdK/G182bGln/5ArO6bo/MTsU3TznoDjeQkr+CqP359DZMCcIr9JHodjEdhbBPgipZet9peN9Nxs8jlk8Q6sryj12rQougbxCEuyDgZIffRGX3l+AQDQ//NtlzLhmoh1SRh917A2g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=59SFms/JUIGx4TGCwTHjqBFEnu6Eg/N9XnnXI3pfq4E=; b=S4Q1S9sHzzrA9Zn/soAK1gYtdtqMo7b+5dH3tFqVep+cPOevKPWZCWwT0LIpvHvOsjBbqhRX/RZ357IXqezoG/hh//wutP3+qZOpD27N8hLpbGga7bLphYQvpHjXOZFLQHJxO8j5dQ/CFcINZnhhl15ya8sBTvpEhxOUf8C61O8= Received: from SJ0PR05CA0194.namprd05.prod.outlook.com (2603:10b6:a03:330::19) by IA0PPFC855560D7.namprd12.prod.outlook.com (2603:10b6:20f:fc04::be4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:48:09 +0000 Received: from SJ1PEPF00002317.namprd03.prod.outlook.com (2603:10b6:a03:330:cafe::72) by SJ0PR05CA0194.outlook.office365.com (2603:10b6:a03:330::19) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9115.15 via Frontend Transport; Wed, 10 Sep 2025 14:48:05 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SJ1PEPF00002317.mail.protection.outlook.com (10.167.242.171) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:48:08 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:47:51 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Date: Wed, 10 Sep 2025 20:16:46 +0530 Message-ID: <20250910144653.212066-2-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ1PEPF00002317:EE_|IA0PPFC855560D7:EE_ X-MS-Office365-Filtering-Correlation-Id: 466821a8-3dcd-4df7-5d1e-08ddf0790f10 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700013|82310400026|1800799024|7416014|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?avUst6lbps44cO2VOJRGLc27X8A3Hmzeh6V1f3XFhnGO18slLvbuBKZB6HXB?= =?us-ascii?Q?z+AyDv8hynzEgirqGXZV0nao4UBW/cLN/OkTglTfHWf2LgC5OgZp993C8F2f?= =?us-ascii?Q?MDL60hD7F4jaUjsEybG7KyVBeY6OrYZPEhghy3+Rwh+wC+0e/rcVdssLt+Ry?= =?us-ascii?Q?guEirsg09PJHDHRJSOFmCZTiIuiKbKm01it6h30f6lU6UfjMN8KBbXQTwosX?= =?us-ascii?Q?qXh7X+k1+raLRP9raxm9G0A3evjg3DUjVsLJ2dHpKNT+8EDkS/H6EB++LL0J?= =?us-ascii?Q?SOYkRsNh8cnv2JqaSf7inUmAY1s0J+IpxwiZQNOG2+e/FZvK6RkakpPLMiXq?= =?us-ascii?Q?BoY/oNvE3IDfLVpkBjIX+WAOrW6jkBjA576T5Wt3TsheGlcwNz6fIYtHoPvl?= =?us-ascii?Q?KPjJQaympL6YQiCapfHC12/M+oDe/yumwEoRJ8rI+n8AXyP2OVJVbfXvuSTh?= =?us-ascii?Q?SxztUtInoEXJ3UmO//D/g12z31JK3pWFCIEfvRkIWT872jakwxBcUsRWOA5S?= =?us-ascii?Q?S6/gnuA8qB9FUE7CAExNZg29JDNGmuvB9XzMm5eq8YVyIo8h0C07DEx/zzhF?= =?us-ascii?Q?JYaytQlcBSYQzVeb6/10Ew4X+gozuOZzWuY2mUxoZa0QFxBSmXC6m8/xhMd4?= =?us-ascii?Q?/HA6n86er79yykF//Gf9tcInW4f1mOOnRf9m5KznxHSwA9Sf/Gb/YnXHAMt/?= =?us-ascii?Q?f+5iTI8ORGz/FhP3ocp1LdAlJ0qqnL4abttvcUFa8+yjqYxiJVvZV+3yBzGB?= =?us-ascii?Q?SVefK0mgPH9kmZ+aIbhXQTJeWMOiZEzmvAexSqhKaxqEYXBd7mQv6NZCu0/u?= =?us-ascii?Q?XENmyBsIjibWlDzxQVAwtJScl1tVQ/HqqT9+3XSDi0j/LZOBTOCKAPyQpiON?= =?us-ascii?Q?/1ldNr2XVBXZhEO554qXNM8b+1WkN8EG3NMVQe2W1eEArasq8rVmlYn130Ev?= =?us-ascii?Q?ScVcix/Cm0axLCbYlSVNv7QrFZdEsPgwJNHrgG9lHLMdk6mAk3u9ESTN1Gly?= =?us-ascii?Q?BWjrVkaMJmzw3dFE8JmbFxthWB4UBT0RqmjGurlhJUAdOnwMiJWvTfrCt/be?= =?us-ascii?Q?1GPOJjoVjDZgBKXG6PcfdqDU7X+Kv32IA1KluhIUvQXcWC4VZSrXfvV1Yr2j?= =?us-ascii?Q?rhbuoM/rd53MNssY4/kVeWRAraMdwFMF/iNxQ10gOLwKWcN6bQkwaALDBEkq?= =?us-ascii?Q?st0V4mfCTbh8uhcuqQA/1dS3k0ybbHBLNfUwAfP+zgmt8aIEmNDvvAWihZ55?= =?us-ascii?Q?95ERjcB1EtSZi1ziUP5GIq4htCRrNP39ufwoYYmDItF6vQflE2Yg89dkMboK?= =?us-ascii?Q?IGLHPthacNDzzIUYHTv4vvl2PpgwKQdsQCMKLoD3cbaYG8A1n2xX4VYSzJSP?= =?us-ascii?Q?b5VTnyohz4XkkS7FVcyYsFbcIPzAA7IfBCmcjWW/w3GqJpqFhXDAvILjueat?= =?us-ascii?Q?qtY/nZbfX5uUdxJtGj1vIagSAL7Qkor2/BJC2N9ke6Kb5XS6FZptUCMKmg0J?= =?us-ascii?Q?6AGoNh19K8PIVVNK7UCsP2EltzqGUEEWF6jb?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700013)(82310400026)(1800799024)(7416014)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:48:08.9570 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 466821a8-3dcd-4df7-5d1e-08ddf0790f10 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00002317.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PPFC855560D7 Content-Type: text/plain; charset="utf-8" We want isolation of misplaced folios to work in contexts where VMA isn't available. In order to prepare for that allow migrate_misplaced_folio_prepare() to be called with a NULL VMA. When migrate_misplaced_folio_prepare() is called with non-NULL VMA, it will check if the folio is mapped shared and that requires holding PTL lock. This path isn't taken when the function is invoked with NULL VMA (migration outside of process context). Hence for such cases, it is not necessary this function be called with PTL lock held. Signed-off-by: Bharata B Rao --- mm/migrate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 425401b2d4e1..7e356c0b1b5a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2619,7 +2619,8 @@ static struct folio *alloc_misplaced_dst_folio(struct= folio *src, =20 /* * Prepare for calling migrate_misplaced_folio() by isolating the folio if - * permitted. Must be called with the PTL still held. + * permitted. Must be called with the PTL still held if called with a non-= NULL + * vma. */ int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -2636,7 +2637,7 @@ int migrate_misplaced_folio_prepare(struct folio *fol= io, * See folio_maybe_mapped_shared() on possible imprecision * when we cannot easily detect if a folio is shared. */ - if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) + if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio)) return -EACCES; =20 /* --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM02-DM3-obe.outbound.protection.outlook.com (mail-dm3nam02on2080.outbound.protection.outlook.com [40.107.95.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E9EA34DCEB for ; Wed, 10 Sep 2025 14:48:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.95.80 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515723; cv=fail; b=VvfVksoAhFC3O9JUfzdTFXibeO80KanLKKlvWxaR9oaCwvH16vQzBisWf0ZS7Ln6Er0I3FKFc6GL02HVcwGFYn4nfUAbtsVFHYYCyUAZzbowstXtxSzk6x/cHPZNlRuvCZcOblfbrD8gDfMCrj4u0MhGo4Rljjz+H/e9G1EWmWY= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515723; c=relaxed/simple; bh=7KV61dZpsL/NjzTtfAJ6X/ejNo3WpSOP1eNRTAAh5+g=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=caf1OxBJNv/aTj9qCXOcTXvV4AbMiRYMP/LsvVo1FPQLKX53OQDDpH4+GaMaOnh6RdmNcsUycmO/qGulzaDYsr9jjl7sNA4PxiVlAzD6da7zt+3iwWN/3fpXN6uGMeS4CjkvA/zviTDEq9XZB0prwWXsH/+TYr3tM9gdnvvv2L4= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=FHO39PVt; arc=fail smtp.client-ip=40.107.95.80 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="FHO39PVt" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=hlzkLV2t2Bdl7uCPh43Vm8Np2ggijuk93gzHeEVpQbhMykxF+0Qvx0LN9WRq1jmlx+WwIoOcz1Ajl9YGjTthjLRZx3uOLUhG1LhhPuFBtdT2gn9gNbg8cyc+3wwBizeNr7Mm5sge/CF5jmn8Ejp0Tp/viqfZG1XRgEE45iy8Mh4bw9pd2uJcoEp6DW0j58UgdmJLH68Sg1qRQZn5jqkvrsctx+CRq+qVXMjr6xifMAYN2EYCI53o9+hlgByC9AjsHGit5IyhNWgngwtQgc1TYJg8WDKL6hLWvJ0yUN9jucj2kuthGA4iX/W0+YCNJ0OPW2StfmcPxhsTwmWSoFGKjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=4pFxsh/MnX5KgYbH3aeIt2sca4RFkSKRgN95HPx9EdQ=; b=cUX0iKHKSpRa2lS5be40BxDcvdkNFevYRH7DgeYlWfpju8tlC8CWm1lEOF2KeYrS7fXo5yD8WFej9ge+k8TG0e8qhmobqcAntnQ2DBGRoum/NdAAJFVw8NXC9tYdLsF8cmguNiRknnFtUpuVT08rNPfISJePe6D2F5R7SBNUAXImvmr5LtsBD4vb3Ir4eem4q0LU15RxxxCO2+xSqa/zWHiji3MJFKOJWHE2ryLwVqp/9xzB+sP0cuiqlWXmQ0uJ4c2de2+VMuPVHPwX41F2YwxxNO6W/ahFPzGAqGi7V0QniamHnoTTyF1kC66NQCvxk6kDzxhB3Nz5MNCW1Wv1Zw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=4pFxsh/MnX5KgYbH3aeIt2sca4RFkSKRgN95HPx9EdQ=; b=FHO39PVtg1DWEcQmqC9B2WQa8KlSEyTpkQgpFJKfOTTvtOySaSu4NppNKmJErnRaUTHHzHNSylecM+kddea6bHOswU2NSwUMyB/Pl1tib6hGC2y7hYdM/kQu0E9rM89oG6DmbTs7wAgEsRAwAOH5U6cNv1XM6DmFV8su0a+Je/E= Received: from SJ0PR13CA0040.namprd13.prod.outlook.com (2603:10b6:a03:2c2::15) by IA0PPFB67404FBA.namprd12.prod.outlook.com (2603:10b6:20f:fc04::be2) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:48:37 +0000 Received: from SJ1PEPF00002313.namprd03.prod.outlook.com (2603:10b6:a03:2c2:cafe::e1) by SJ0PR13CA0040.outlook.office365.com (2603:10b6:a03:2c2::15) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9115.15 via Frontend Transport; Wed, 10 Sep 2025 14:48:36 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SJ1PEPF00002313.mail.protection.outlook.com (10.167.242.167) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:48:36 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:48:26 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Date: Wed, 10 Sep 2025 20:16:47 +0530 Message-ID: <20250910144653.212066-3-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ1PEPF00002313:EE_|IA0PPFB67404FBA:EE_ X-MS-Office365-Filtering-Correlation-Id: 1200334c-f067-4492-0bf5-08ddf0791fb0 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700013|1800799024|82310400026|7416014|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?WCv6LQ92JNEhVpiQyugrMwPSW22htzC8RhY4ZpzI+gP2+6IAo3LZSPVaYAa6?= =?us-ascii?Q?mdRpJf8tM0G8WinBSYL0dudjRlUxzde96F1gTCP2HgXSS0qw3ILnNGePovxt?= =?us-ascii?Q?7dO9OUuVMutGjMWkvUXCeYNmCOK0xNL0IeykN1JD9yhm/n19RFJJvNVA2u4w?= =?us-ascii?Q?bi8mA6SmIqI6hbHpCYb8riPXujApy8T+FG53PqoTIkTzOquixfXv8mB2OSM0?= =?us-ascii?Q?L4eQeDC1NNaPf4HGV8QVNEhuKhgU/RUsjUgWuXJT7nItpKQmDAi3nZn+BmDG?= =?us-ascii?Q?PiVGAEuCYIrTYpQxH2umX0IHVcKdjJy5wwzNBc8WK8j4IqyrqK7kJVaBQ9uR?= =?us-ascii?Q?0Mh6/bkD4St6xRpvP9GIHivvBVWa1vNicyFqdKakA7yx9Jimlf6aTBCXW3K+?= =?us-ascii?Q?ZojX+s7VdT2JAFqkgbHMHjx3BDZn3WVdNsT0z50elcfKxOKDV5Gu0lOiJE/l?= =?us-ascii?Q?TJXH5FK15hvRoID0UcR/qyUGRekL4U7QCzEVXEvmyn3y9B/xSBl2RhJPMEXu?= =?us-ascii?Q?AKiNcANbKTfDzb/sQjoDRWOyqYC/UOfhk6wRwv5W0mf4ltbugvEyo1Wyyp/n?= =?us-ascii?Q?pU7DHXIYB4JvdOowWXLoiqiWZqsinSuDgyqVmIXxPKE4liAMCXv6wb9AhKzz?= =?us-ascii?Q?2Ucu9HxwZcmx5ml38GUtBYFXB9TXCf1yAJwXWjWDi2l1aYCoUwKF9JEiv+xa?= =?us-ascii?Q?q2W/deSWg1UKzRuch+Eu4dJBB8Q+lYAg5Xw5OJiIjeDlg64j35lWeMahT3AF?= =?us-ascii?Q?acYQ/NgORmgHTgY4nYekfbvTfXcbk6N0WDaw8WMfUVhChPairtn3mfZxoZBw?= =?us-ascii?Q?2k6KABSnV5I1LUYs3fc2V+KzLWcZ1JD3hNffFy6cjXAWiogfcLTDbi/DJoPV?= =?us-ascii?Q?23DZlLeJUI2WcG8gIU7bM8Ij9S/KpukBGnOQS2Jq0kr8YYQRh2bGDBpXGkmi?= =?us-ascii?Q?H40vy2xqZNnRPodlJCMC6B9A/NiqRa/V93mYXGcEY9MdMFG86NCjD0ijvq6p?= =?us-ascii?Q?ftTQFtmMBWxgJerxRRkqp0AdJqYRNlRS7cqnDx4mH/+Jk554PLU8INwWeJ7G?= =?us-ascii?Q?n8uZO2/Il528Xkdze/Gw/bfJacqyYmDNkpeG81UgUcK65htLhiEfkhtFNWVb?= =?us-ascii?Q?Du4ErnxjE784MY+KKI4VQkmKATpM7HQ+Ox4F1t7aOrXAr5EPNjtnqgc0yFt4?= =?us-ascii?Q?Ws2i3lKhxyrFtNeOy4ql0xVu5kmiku4lwA4mYoIRp+Eei4pybRyRa40kcpH2?= =?us-ascii?Q?2htpMuieUXf16BVzHQwdF1ttavuwTnOjSILGBHOA9xig5C1qHujJ9h2NY4eG?= =?us-ascii?Q?xmCk7vBp5RkuutpNkNqIgTmUE5e85YoS/tm/bIyLU/9zBQ4Up6r4uSsCfp+V?= =?us-ascii?Q?xbuM/ce4feq6Phz4pLLFZXqAhVjdDBzpjG8s26VDo2qh3mTeVHqy2J78dMq2?= =?us-ascii?Q?daPT5GocKLzfZZLiAGHvraB9Yl0m2ux+Fs81deS7R5xFaCRSibcmTSHRsGUi?= =?us-ascii?Q?wlHXC1QB9SQtkNepvk0b56vISfKfZxRXRh8N?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700013)(1800799024)(82310400026)(7416014)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:48:36.8468 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 1200334c-f067-4492-0bf5-08ddf0791fb0 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SJ1PEPF00002313.namprd03.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PPFB67404FBA Content-Type: text/plain; charset="utf-8" From: Gregory Price A common operation in tiering is to migrate multiple pages at once. The migrate_misplaced_folio function requires one call for each individual folio. Expose a batch-variant of the same call for use when doing batch migrations. Signed-off-by: Gregory Price Signed-off-by: Bharata B Rao --- include/linux/migrate.h | 6 ++++++ mm/migrate.c | 31 +++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index acadd41e0b5c..0593f5869be8 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -107,6 +107,7 @@ static inline int migrate_huge_page_move_mapping(struct= address_space *mapping, int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node); int migrate_misplaced_folio(struct folio *folio, int node); +int migrate_misplaced_folios_batch(struct list_head *foliolist, int node); #else static inline int migrate_misplaced_folio_prepare(struct folio *folio, struct vm_area_struct *vma, int node) @@ -117,6 +118,11 @@ static inline int migrate_misplaced_folio(struct folio= *folio, int node) { return -EAGAIN; /* can't migrate now */ } +static inline int migrate_misplaced_folios_batch(struct list_head *folioli= st, + int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_NUMA_BALANCING */ =20 #ifdef CONFIG_MIGRATION diff --git a/mm/migrate.c b/mm/migrate.c index 7e356c0b1b5a..1268a95eda0e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2714,5 +2714,36 @@ int migrate_misplaced_folio(struct folio *folio, int= node) BUG_ON(!list_empty(&migratepages)); return nr_remaining ? -EAGAIN : 0; } + +/* + * Batch variant of migrate_misplaced_folio. Attempts to migrate + * a folio list to the specified destination. + * + * Caller is expected to have isolated the folios by calling + * migrate_misplaced_folio_prepare(), which will result in an + * elevated reference count on the folio. + * + * This function will un-isolate the folios, dereference them, and + * remove them from the list before returning. + */ +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node) +{ + pg_data_t *pgdat =3D NODE_DATA(node); + unsigned int nr_succeeded; + int nr_remaining; + + nr_remaining =3D migrate_pages(folio_list, alloc_misplaced_dst_folio, + NULL, node, MIGRATE_ASYNC, + MR_NUMA_MISPLACED, &nr_succeeded); + if (nr_remaining) + putback_movable_pages(folio_list); + + if (nr_succeeded) { + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); + mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded); + } + BUG_ON(!list_empty(folio_list)); + return nr_remaining ? -EAGAIN : 0; +} #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_NUMA */ --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2076.outbound.protection.outlook.com [40.107.94.76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BFC8E32C312 for ; Wed, 10 Sep 2025 14:49:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.94.76 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515753; cv=fail; b=DcJ23Az/t3Od1/eJFi9aBh8xzPvkldTq9S33WiZM0FzgC7MfLBdBFVjgLhkiSvOGY3jdPpXRlUeJ423H+MCka8WdSnq+VTjhwAKm7tpdb4sy0ZoGwl5WDMtQ/uzLppUloZ3k7AoiqpubcmKGWOi74CxSckVJClLiycuZFaAoLms= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515753; c=relaxed/simple; bh=utqbH9dZRDkbpGrRH6+1/Y87x241keMFMpHKBPDWuLQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=uS9k/zAsQUUBPDFyfd50nyL4hlF6ljWPnKZSbXzSOyJK3aRIUg/JHxC+LASSuHaox6Ul55QIJnmKjW7Pu7rKVdLfCX7Oi17v1VkpiZm5qmDXY9MWY1JOJKHuyDT4f7eznIlwG6o26wmnPZAbyc9Dm0hg9duEJajmq7kRzk7omzI= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=oyAZryh/; arc=fail smtp.client-ip=40.107.94.76 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="oyAZryh/" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=J3UM9ocRGhjq8xWGRIfg7or4yarSHsHju0iIH9WsN5GRfR7lCEmmXtRLMUCRV7XeJu/V6lwpK1sAjNL2IFevIx2vgJT0To/3JUeOnOyFjnHj+2DHnidIrl+rykYwTBtxA1yO9ezRdPNCJxQXVKGHDiexR/SMcLJo6rGSlU6aGsLCuy3M0S+UbHXYRGMPMNKsQzVn8uDTtacLRvUFxX9XvqQQ5EZvJS35Y96ALx8dP/qDv6SXzuDg72unU9YN9gs1QqqMzQVsjh1tG8+KqMbuyPgvdG/2psUlF7yzubXBLN686sJFsAvjYDNN1DrIcCD3bZUFxLAd9cY8zMyxsWNBTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=PFUXSCHA/BwKc7XP9PNRz/2krK6hKSkQckMDIabpVlk=; b=fEapk74Pl9Ln82JGU0g7ke/Wm/F5a4xU5LUbNqVdve9AcfdoV4fp4OF/qk9itIRbbqJuek2grtdC51slicehDg3NlOpjAs1Tpe9vXv04KSMi5VnwnnGyRDCogUJT5m7dQDlLsV7nVCEoXHc+BmpCEEPmvfB2+Y0D9rRIXbuadk0qiwrh0sIGbEWLHFQ51nVggSi7jdp0tigSIHIymZigIpwkiFhCAfEvD/sASJfseFfgONtA5swmtgKvLQuxfBnmcDfLgPucI3jHiZJssJfwnLdMVYvHg+MJVVzU5i55QvrMusslXQ84iuUsvRv/t9T2TUQqPwKHR7BZuieXP87Tig== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=PFUXSCHA/BwKc7XP9PNRz/2krK6hKSkQckMDIabpVlk=; b=oyAZryh/uEJSyV69VgtV4YSI6zCYJBY/97S9ckoG9NYe6t86dgXyFrIhsHn8sG08lzNjiL/okbTcAvQ4P/Yn8eHFxkBaKPVeqzm++Sve77dL8/N3drZ4pR6RoNeVKdUKjOB95ZlF/HpRlo9j+ZIOFKwiJ8MoD++Ex/G7W0n0N5c= Received: from SA9P223CA0007.NAMP223.PROD.OUTLOOK.COM (2603:10b6:806:26::12) by LV8PR12MB9084.namprd12.prod.outlook.com (2603:10b6:408:18e::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:49:08 +0000 Received: from SA2PEPF0000150B.namprd04.prod.outlook.com (2603:10b6:806:26:cafe::6b) by SA9P223CA0007.outlook.office365.com (2603:10b6:806:26::12) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9094.23 via Frontend Transport; Wed, 10 Sep 2025 14:49:08 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SA2PEPF0000150B.mail.protection.outlook.com (10.167.242.43) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:49:07 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:48:56 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Date: Wed, 10 Sep 2025 20:16:48 +0530 Message-ID: <20250910144653.212066-4-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF0000150B:EE_|LV8PR12MB9084:EE_ X-MS-Office365-Filtering-Correlation-Id: 6e7c9bee-555d-4471-61c9-08ddf079322a X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|376014|7416014|82310400026|36860700013; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?lu9lR2dyc+FZxA8VDubSTfWmTOLO3RpaFTYbYvv5vlB3/L8J0Dk69hbHc6k2?= =?us-ascii?Q?dKOT90epbEDYAGSK376H2ySIeGXG2FbchdKMsp2jMMofVF9CJuhsijeTCh23?= =?us-ascii?Q?hOpGhCovBlkM5Q0ACbHNxu81tNV60zMC24ggas7QMm8pLjODvBUi1tloJUa3?= =?us-ascii?Q?5grBJAzjI8DjT2W651qLn5iqd8DYiuOVkKja5hH/svVG2MvzcXUI6o206GhA?= =?us-ascii?Q?x3Hh8ckgCeib/XVwkR9L7A4w09y4GJJUIJojG1BOSdR5LlnbW2f33RUvG47c?= =?us-ascii?Q?X6hPZyQrD7oNcf9MNHgqVOAc71pu4NOvRp6e/4omXR7V6tTBuvcP/JcZM8oG?= =?us-ascii?Q?SQvd+LJ1aYvyLJqnOP3dG6xcZBuVxB2/PnTDwNyn12tS8QMrmPmjyEWvcIJj?= =?us-ascii?Q?eNWsWKlE+KTDv/qCUKbYPV2rKMHU3IOdJBJVVRGn9stdecbk/uudVNYl7F/f?= =?us-ascii?Q?Jww+NLtYTXACFlLUrqARSWb0uoTAnzqqhA5O+RrainfssWyzsyIwgLyOISwU?= =?us-ascii?Q?GjyUYydRT2nHGeAabhnVv1LgI6lPhys0LGDNCgVFYGc+LKJZ6v4K+vV9G7kB?= =?us-ascii?Q?eN2Jp7CKcolOsRGE5EZHDEg6QG1zeJ73qIiHTQt1eNcEEwDpnCAoX3BWEn4L?= =?us-ascii?Q?/5SPq2SbM6ZZ0gEPBvmH4EkGamzQIprSdnO7pdYDw6ylb4bG5Ww43zMJ92Xb?= =?us-ascii?Q?SttoaeHjonRxUnqgMBb+sPh6jbx27GxmZkk8FpRtHwW/uKoE9Zx8SZwIqXlr?= =?us-ascii?Q?/2dyfzObnA0tgSzkaL/WcHhdhUMREbNER2lg3oAItjTgdEoLtluAox01amGs?= =?us-ascii?Q?WqQeIM+lKokTatIu9hOSWBHeIZ5kBb+19jGlo2w0CREkGbp37Hjm6p7rSUq2?= =?us-ascii?Q?SxJsT8xNmVhNuwvMAVKAPogNDjjDbMDLR0extrZ+zYthVj1iuNCCPB0PxUsQ?= =?us-ascii?Q?9khSHegf94z4WnDjXyhH4XXmQtvnBWdZrKVawOJS3ozqhUu0st6QW8WqyHGB?= =?us-ascii?Q?TUXNW+tN0w3gRM+QwmuakB1ZtAlYC5L2J9scy3Dcs0Y9yDQJo5L1+ovQ1SC/?= =?us-ascii?Q?ye0tZpLl9cYRdWRxn4Jbl5bR1Y2fBKfmLxpRXhAmCAd5wkUTYrcQJO+kuIeG?= =?us-ascii?Q?Mu/SRmNbfrTyPl7BQ/DKNQQ0aPQZA5gG3ZZAfMgIFbhKbj/DCUh+Jk8ncJC5?= =?us-ascii?Q?WuTC4B9gBr7K7T2M+xdJWk5nvHi2XZQL9cfYH+Q353oH/xoq/CfmFXsfQ3fV?= =?us-ascii?Q?eUetM0vE0xtcBA6wNKynrOvAfWG0XXHul/d8TkuBwdcNVtLvePkdK0tH6fxR?= =?us-ascii?Q?+Z7G3uw/rW5GzB1iT6TRblP/S2PIb0Subj3nnHgo1rFy6mcQatOlpAkqGAtE?= =?us-ascii?Q?cqdn4nFhFN6JFG79BzKdprm4PtNKuqn8vIQnFNWxXYW02A8YjraBs3BcdSas?= =?us-ascii?Q?/ZosI9mF2SqF3ImuekK0WTi0MCIthM2mjxOqN34xYEuBZDDQh39hH8mJeKZB?= =?us-ascii?Q?V9gCCT+3/7ruxGjXS1uunvoGbkJaEg7KN+MG?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(376014)(7416014)(82310400026)(36860700013);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:49:07.9143 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 6e7c9bee-555d-4471-61c9-08ddf079322a X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF0000150B.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV8PR12MB9084 Content-Type: text/plain; charset="utf-8" This introduces a sub-system for collecting memory access information from different sources. It maintains the hotness information based on the access history and time of access. Additionally, it provides per-lowertier-node kernel threads (named kpromoted) that periodically promote the pages that are eligible for promotion. Sub-systems that generate hot page access info can report that using this API: int pghot_record_access(u64 pfn, int nid, int src, unsigned long time) @pfn: The PFN of the memory accessed @nid: The accessing NUMA node ID @src: The temperature source (sub-system) that generated the access info @time: The access time in jiffies Some temperature sources may not provide the nid from which the page was accessed. This is true for sources that use page table scanning for PTE Accessed bit. For such sources, the default toptier node to which such pages should be promoted is hard coded. Also, the access time provided some sources may at best be considered approximate. This is especially true for hot pages detected by PTE A bit scanning. The hot PFN records are stored in hash lists hashed by PFN value. The PFN records that are categorized as hot enough to be promoted are maintained in a per-lowertier-node max heap from which kpromoted extracts and promotes them. Signed-off-by: Bharata B Rao --- include/linux/mmzone.h | 11 + include/linux/pghot.h | 96 +++++++ include/linux/vm_event_item.h | 9 + mm/Kconfig | 11 + mm/Makefile | 1 + mm/mm_init.c | 10 + mm/pghot.c | 524 ++++++++++++++++++++++++++++++++++ mm/vmstat.c | 9 + 8 files changed, 671 insertions(+) create mode 100644 include/linux/pghot.h create mode 100644 mm/pghot.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0c5da9141983..f7094babed10 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1349,6 +1349,10 @@ struct memory_failure_stats { }; #endif =20 +#ifdef CONFIG_PGHOT +#include +#endif + /* * On NUMA machines, each NUMA node would have a pg_data_t to describe * it's memory layout. On UMA machines there is a single pglist_data which @@ -1497,6 +1501,13 @@ typedef struct pglist_data { #ifdef CONFIG_MEMORY_FAILURE struct memory_failure_stats mf_stats; #endif +#ifdef CONFIG_PGHOT + struct task_struct *kpromoted; + wait_queue_head_t kpromoted_wait; + struct pghot_info **phi_buf; + struct max_heap heap; + spinlock_t heap_lock; +#endif } pg_data_t; =20 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/include/linux/pghot.h b/include/linux/pghot.h new file mode 100644 index 000000000000..1443643aab13 --- /dev/null +++ b/include/linux/pghot.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KPROMOTED_H +#define _LINUX_KPROMOTED_H + +#include +#include +#include + +/* Page hotness temperature sources */ +enum pghot_src { + PGHOT_HW_HINTS, + PGHOT_PGTABLE_SCAN, + PGHOT_HINT_FAULT, +}; + +#ifdef CONFIG_PGHOT + +#define KPROMOTED_FREQ_WINDOW (5 * MSEC_PER_SEC) + +/* 2 accesses within a window will make the page a promotion candidate */ +#define KPROMOTED_FREQ_THRESHOLD 2 + +#define PGHOT_FREQ_BITS 3 +#define PGHOT_NID_BITS 10 +#define PGHOT_TIME_BITS 19 + +#define PGHOT_FREQ_MAX (1 << PGHOT_FREQ_BITS) +#define PGHOT_NID_MAX (1 << PGHOT_NID_BITS) + +/* + * last_update is stored in 19 bits which can represent up to + * 8.73s with HZ=3D1000 + */ +#define PGHOT_TIME_MASK GENMASK_U32(PGHOT_TIME_BITS - 1, 0) + +/* + * The following two defines control the number of hash lists + * that are maintained for tracking PFN accesses. + */ +#define PGHOT_HASH_PCT 50 /* % of lower tier memory pages to track */ +#define PGHOT_HASH_ENTRIES 1024 /* Number of entries per list, ideal case = */ + +/* + * Percentage of hash entries that can reside in heap as migrate-ready + * candidates + */ +#define PGHOT_HEAP_PCT 25 + +#define KPROMOTED_MIGRATE_BATCH 1024 + +/* + * If target NID isn't available, kpromoted promotes to node 0 + * by default. + * + * TODO: Need checks to validate that default node is indeed + * present and is a toptier node. + */ +#define KPROMOTED_DEFAULT_NODE 0 + +struct pghot_info { + unsigned long pfn; + + /* + * The following three fundamental parameters + * required to track the hotness of page/PFN are + * packed within a single u32. + */ + u32 frequency:PGHOT_FREQ_BITS; /* Number of accesses within current windo= w */ + u32 nid:PGHOT_NID_BITS; /* Most recent access from this node */ + u32 last_update:PGHOT_TIME_BITS; /* Most recent access time */ + + struct hlist_node hnode; + size_t heap_idx; /* Position in max heap for quick retreival */ +}; + +struct max_heap { + size_t nr; + size_t size; + struct pghot_info **data; + DECLARE_FLEX_ARRAY(struct pghot_info *, preallocated); +}; + +/* + * The wakeup interval of kpromoted threads + */ +#define KPROMOTE_DELAY 20 /* 20ms */ + +int pghot_record_access(u64 pfn, int nid, int src, unsigned long now); +#else +static inline int pghot_record_access(u64 pfn, int nid, int src, + unsigned long now) +{ + return 0; +} +#endif /* CONFIG_PGHOT */ +#endif /* _LINUX_KPROMOTED_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 9e15a088ba38..a996fa9df785 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -186,6 +186,15 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSTACK_REST, #endif #endif /* CONFIG_DEBUG_STACK_USAGE */ + PGHOT_RECORDED_ACCESSES, + PGHOT_RECORD_HWHINTS, + PGHOT_RECORD_PGTSCANS, + PGHOT_RECORD_HINTFAULTS, + PGHOT_RECORDS_HASH, + PGHOT_RECORDS_HEAP, + KPROMOTED_RIGHT_NODE, + KPROMOTED_NON_LRU, + KPROMOTED_DROPPED, NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..8b236eb874cf 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1381,6 +1381,17 @@ config PT_RECLAIM =20 Note: now only empty user PTE page table pages will be reclaimed. =20 +config PGHOT + bool "Hot page tracking and promotion" + def_bool y + depends on NUMA && MIGRATION && MMU + select MIN_HEAP + help + A sub-system to track page accesses in lower tier memory and + maintain hot page information. Promotes hot pages from lower + tiers to top tier by using the memory access information provided + by various sources. Asynchronous promotion is done by per-node + kernel threads. =20 source "mm/damon/Kconfig" =20 diff --git a/mm/Makefile b/mm/Makefile index ef54aa615d9d..ecdd5241bea8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) +=3D shrinker_debug.o obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_PT_RECLAIM) +=3D pt_reclaim.o +obj-$(CONFIG_PGHOT) +=3D pghot.o diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b..f7992be3ff7f 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1402,6 +1402,15 @@ static void pgdat_init_kcompactd(struct pglist_data = *pgdat) static void pgdat_init_kcompactd(struct pglist_data *pgdat) {} #endif =20 +#ifdef CONFIG_PGHOT +static void pgdat_init_kpromoted(struct pglist_data *pgdat) +{ + init_waitqueue_head(&pgdat->kpromoted_wait); +} +#else +static void pgdat_init_kpromoted(struct pglist_data *pgdat) {} +#endif + static void __meminit pgdat_init_internals(struct pglist_data *pgdat) { int i; @@ -1411,6 +1420,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + pgdat_init_kpromoted(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/pghot.c b/mm/pghot.c new file mode 100644 index 000000000000..9f7581818b8f --- /dev/null +++ b/mm/pghot.c @@ -0,0 +1,524 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Maintains information about hot pages from slower tier nodes and + * promotes them. + * + * Info about accessed pages are stored in hash lists indexed by PFN. + * Info about pages that are hot enough to be promoted are stored in + * a per-toptier-node max_heap. + * + * kpromoted is a kernel thread that runs on each toptier node and + * promotes pages from max_heap. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct pghot_hash { + struct hlist_head hash; + spinlock_t lock; +}; + +static struct pghot_hash *phi_hash; +static int phi_hash_order; +static int phi_heap_entries; +static struct kmem_cache *phi_cache __ro_after_init; +static bool kpromoted_started __ro_after_init; + +static unsigned int sysctl_pghot_freq_window =3D KPROMOTED_FREQ_WINDOW; + +#ifdef CONFIG_SYSCTL +static const struct ctl_table pghot_sysctls[] =3D { + { + .procname =3D "pghot_promote_freq_window_ms", + .data =3D &sysctl_pghot_freq_window, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, +}; +#endif +static bool phi_heap_less(const void *lhs, const void *rhs, void *args) +{ + return (*(struct pghot_info **)lhs)->frequency > + (*(struct pghot_info **)rhs)->frequency; +} + +static void phi_heap_swp(void *lhs, void *rhs, void *args) +{ + struct pghot_info **l =3D (struct pghot_info **)lhs; + struct pghot_info **r =3D (struct pghot_info **)rhs; + int lindex =3D l - (struct pghot_info **)args; + int rindex =3D r - (struct pghot_info **)args; + struct pghot_info *tmp =3D *l; + + *l =3D *r; + *r =3D tmp; + + (*l)->heap_idx =3D lindex; + (*r)->heap_idx =3D rindex; +} + +static const struct min_heap_callbacks phi_heap_cb =3D { + .less =3D phi_heap_less, + .swp =3D phi_heap_swp, +}; + +static void phi_heap_update_entry(struct max_heap *phi_heap, struct pghot_= info *phi) +{ + int orig_idx =3D phi->heap_idx; + + min_heap_sift_up(phi_heap, phi->heap_idx, &phi_heap_cb, + phi_heap->data); + if (phi_heap->data[phi->heap_idx]->heap_idx =3D=3D orig_idx) + min_heap_sift_down(phi_heap, phi->heap_idx, + &phi_heap_cb, phi_heap->data); +} + +static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *= phi) +{ + if (phi_heap->nr >=3D phi_heap_entries) + return false; + + phi->heap_idx =3D phi_heap->nr; + min_heap_push(phi_heap, &phi, &phi_heap_cb, phi_heap->data); + + return true; +} + +static bool phi_is_pfn_hot(struct pghot_info *phi) +{ + struct page *page =3D pfn_to_online_page(phi->pfn); + unsigned long now =3D jiffies; + struct folio *folio; + + if (!page || is_zone_device_page(page)) + return false; + + folio =3D page_folio(page); + if (!folio_test_lru(folio)) { + count_vm_event(KPROMOTED_NON_LRU); + return false; + } + if (folio_nid(folio) =3D=3D phi->nid) { + count_vm_event(KPROMOTED_RIGHT_NODE); + return false; + } + + return true; +} + +static struct folio *kpromoted_isolate_folio(struct pghot_info *phi) +{ + struct page *page =3D pfn_to_page(phi->pfn); + struct folio *folio; + + if (!page) + return NULL; + + folio =3D page_folio(page); + if (migrate_misplaced_folio_prepare(folio, NULL, phi->nid)) + return NULL; + else + return folio; +} + +static struct pghot_info *phi_alloc(unsigned long pfn) +{ + struct pghot_info *phi; + + phi =3D kmem_cache_zalloc(phi_cache, GFP_NOWAIT); + if (!phi) + return NULL; + + phi->pfn =3D pfn; + phi->heap_idx =3D -1; + return phi; +} + +static inline void phi_free(struct pghot_info *phi) +{ + kmem_cache_free(phi_cache, phi); +} + +static int phi_heap_extract(pg_data_t *pgdat, int batch_count, int freq_th, + struct list_head *migrate_list, int *count) +{ + spinlock_t *phi_heap_lock =3D &pgdat->heap_lock; + struct max_heap *phi_heap =3D &pgdat->heap; + int max_retries =3D 10; + int bkt, i =3D 0; + + if (batch_count < 0 || !migrate_list || !count || freq_th < 1 || + freq_th > KPROMOTED_FREQ_THRESHOLD) + return -EINVAL; + + *count =3D 0; + for (i =3D 0; i < batch_count; i++) { + struct pghot_info *top =3D NULL; + bool should_continue =3D false; + struct folio *folio; + int retries =3D 0; + + while (retries < max_retries) { + spin_lock(phi_heap_lock); + if (phi_heap->nr > 0 && phi_heap->data[0]->frequency >=3D freq_th) { + should_continue =3D true; + bkt =3D hash_min(phi_heap->data[0]->pfn, phi_hash_order); + top =3D phi_heap->data[0]; + } + spin_unlock(phi_heap_lock); + + if (!should_continue) + goto done; + + spin_lock(&phi_hash[bkt].lock); + spin_lock(phi_heap_lock); + if (phi_heap->nr =3D=3D 0 || phi_heap->data[0] !=3D top || + phi_heap->data[0]->frequency < freq_th) { + spin_unlock(phi_heap_lock); + spin_unlock(&phi_hash[bkt].lock); + retries++; + continue; + } + + top =3D phi_heap->data[0]; + hlist_del_init(&top->hnode); + + phi_heap->nr--; + if (phi_heap->nr > 0) { + phi_heap->data[0] =3D phi_heap->data[phi_heap->nr]; + phi_heap->data[0]->heap_idx =3D 0; + min_heap_sift_down(phi_heap, 0, &phi_heap_cb, + phi_heap->data); + } + + spin_unlock(phi_heap_lock); + spin_unlock(&phi_hash[bkt].lock); + + if (!phi_is_pfn_hot(top)) { + count_vm_event(KPROMOTED_DROPPED); + goto skip; + } + + folio =3D kpromoted_isolate_folio(top); + if (folio) { + list_add(&folio->lru, migrate_list); + (*count)++; + } +skip: + phi_free(top); + break; + } + if (retries >=3D max_retries) { + pr_warn("%s: Too many retries\n", __func__); + break; + } + + } +done: + return 0; +} + +static void phi_heap_add_or_adjust(struct pghot_info *phi) +{ + pg_data_t *pgdat =3D NODE_DATA(phi->nid); + struct max_heap *phi_heap =3D &pgdat->heap; + + spin_lock(&pgdat->heap_lock); + if (phi->heap_idx >=3D 0 && phi->heap_idx < phi_heap->nr && + phi_heap->data[phi->heap_idx] =3D=3D phi) { + /* Entry exists in heap */ + if (phi->frequency < KPROMOTED_FREQ_THRESHOLD) { + /* Below threshold, remove from the heap */ + phi_heap->nr--; + if (phi->heap_idx < phi_heap->nr) { + phi_heap->data[phi->heap_idx] =3D + phi_heap->data[phi_heap->nr]; + phi_heap->data[phi->heap_idx]->heap_idx =3D + phi->heap_idx; + min_heap_sift_down(phi_heap, phi->heap_idx, + &phi_heap_cb, phi_heap->data); + } + phi->heap_idx =3D -1; + + } else { + /* Update position in heap */ + phi_heap_update_entry(phi_heap, phi); + } + } else if (phi->frequency >=3D KPROMOTED_FREQ_THRESHOLD) { + /* + * Add to the heap. If heap is full we will have + * to wait for the next access reporting to elevate + * it to heap. + */ + if (phi_heap_insert(phi_heap, phi)) + count_vm_event(PGHOT_RECORDS_HEAP); + } + spin_unlock(&pgdat->heap_lock); +} + +static struct pghot_info *phi_lookup(unsigned long pfn, int bkt) +{ + struct pghot_info *phi; + + hlist_for_each_entry(phi, &phi_hash[bkt].hash, hnode) { + if (phi->pfn =3D=3D pfn) + return phi; + } + return NULL; +} + +/* + * Called by subsystems that generate page hotness/access information. + * + * @pfn: The PFN of the memory accessed + * @nid: The accessing NUMA node ID + * @src: The temperature source (sub-system) that generated the + * access info + * @time: The access time in jiffies + * + * Maintains the access records per PFN, classifies them as + * hot based on subsequent accesses and finally hands over + * them to kpromoted for migration. + */ +int pghot_record_access(u64 pfn, int nid, int src, unsigned long now) +{ + struct pghot_info *phi; + struct page *page; + struct folio *folio; + int bkt; + bool new_entry =3D false, new_window =3D false; + u32 cur_time =3D now & PGHOT_TIME_MASK; + + if (!kpromoted_started) + return -EINVAL; + + if (nid >=3D PGHOT_NID_MAX) + return -EINVAL; + + count_vm_event(PGHOT_RECORDED_ACCESSES); + + switch (src) { + case PGHOT_HW_HINTS: + count_vm_event(PGHOT_RECORD_HWHINTS); + break; + case PGHOT_PGTABLE_SCAN: + count_vm_event(PGHOT_RECORD_PGTSCANS); + break; + case PGHOT_HINT_FAULT: + count_vm_event(PGHOT_RECORD_HINTFAULTS); + break; + default: + return -EINVAL; + } + + /* + * Record only accesses from lower tiers. + */ + if (node_is_toptier(pfn_to_nid(pfn))) + return 0; + + /* + * Reject the non-migratable pages right away. + */ + page =3D pfn_to_online_page(pfn); + if (!page || is_zone_device_page(page)) + return 0; + + folio =3D page_folio(page); + if (!folio_test_lru(folio)) + return 0; + + bkt =3D hash_min(pfn, phi_hash_order); + spin_lock(&phi_hash[bkt].lock); + phi =3D phi_lookup(pfn, bkt); + if (!phi) { + phi =3D phi_alloc(pfn); + if (!phi) + goto out; + new_entry =3D true; + } + + /* + * If the previous access was beyond the threshold window + * start frequency tracking afresh. + */ + if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_w= indow)) || + (nid !=3D NUMA_NO_NODE && phi->nid !=3D nid)) + new_window =3D true; + + if (new_entry || new_window) { + /* New window */ + phi->frequency =3D 1; /* TODO: Factor in the history */ + } else if (phi->frequency < PGHOT_FREQ_MAX) + phi->frequency++; + phi->last_update =3D cur_time; + phi->nid =3D (nid =3D=3D NUMA_NO_NODE) ? KPROMOTED_DEFAULT_NODE : nid; + + if (new_entry) { + /* Insert the new entry into hash table */ + hlist_add_head(&phi->hnode, &phi_hash[bkt].hash); + count_vm_event(PGHOT_RECORDS_HASH); + } else { + /* Add/update the position in heap */ + phi_heap_add_or_adjust(phi); + } +out: + spin_unlock(&phi_hash[bkt].lock); + return 0; +} + +/* + * Extract the hot page records and batch-migrate the + * hot pages. + */ +static void kpromoted_migrate(pg_data_t *pgdat) +{ + int count, ret; + LIST_HEAD(migrate_list); + + /* + * Extract the top N elements from the heap that match + * the requested hotness threshold. + * + * PFNs ineligible from migration standpoint are removed + * from the heap and hash. + * + * Folios eligible for migration are isolated and returned + * in @migrate_list. + */ + ret =3D phi_heap_extract(pgdat, KPROMOTED_MIGRATE_BATCH, + KPROMOTED_FREQ_THRESHOLD, &migrate_list, &count); + if (ret) + return; + + if (!list_empty(&migrate_list)) + migrate_misplaced_folios_batch(&migrate_list, pgdat->node_id); +} + +static int kpromoted(void *p) +{ + pg_data_t *pgdat =3D (pg_data_t *)p; + + while (!kthread_should_stop()) { + wait_event_timeout(pgdat->kpromoted_wait, false, + msecs_to_jiffies(KPROMOTE_DELAY)); + kpromoted_migrate(pgdat); + } + return 0; +} + +static int kpromoted_run(int nid) +{ + pg_data_t *pgdat =3D NODE_DATA(nid); + int ret =3D 0; + + if (!node_is_toptier(nid)) + return 0; + + if (!pgdat->phi_buf) { + pgdat->phi_buf =3D vzalloc_node(phi_heap_entries * sizeof(struct pghot_i= nfo *), + nid); + if (!pgdat->phi_buf) + return -ENOMEM; + + min_heap_init(&pgdat->heap, pgdat->phi_buf, phi_heap_entries); + spin_lock_init(&pgdat->heap_lock); + } + + if (!pgdat->kpromoted) + pgdat->kpromoted =3D kthread_create_on_node(kpromoted, pgdat, nid, + "kpromoted%d", nid); + if (IS_ERR(pgdat->kpromoted)) { + ret =3D PTR_ERR(pgdat->kpromoted); + pgdat->kpromoted =3D NULL; + pr_info("Failed to start kpromoted%d, ret %d\n", nid, ret); + } else { + wake_up_process(pgdat->kpromoted); + } + return ret; +} + +/* + * TODO: Handle cleanup during node offline. + */ +static int __init pghot_init(void) +{ + unsigned int hash_size; + size_t hash_entries; + size_t nr_pages =3D 0; + pg_data_t *pgdat; + int i, nid, ret; + + /* + * Arrive at the hash and heap sizes based on the + * number of pages present in the lower tier nodes. + */ + for_each_node_state(nid, N_MEMORY) { + if (!node_is_toptier(nid)) + nr_pages +=3D NODE_DATA(nid)->node_present_pages; + } + + if (!nr_pages) + return 0; + + hash_entries =3D nr_pages * PGHOT_HASH_PCT / 100; + hash_size =3D hash_entries / PGHOT_HASH_ENTRIES; + phi_hash_order =3D ilog2(hash_size); + + phi_hash =3D vmalloc(sizeof(struct pghot_hash) * hash_size); + if (!phi_hash) { + ret =3D -ENOMEM; + goto out; + } + + for (i =3D 0; i < hash_size; i++) { + INIT_HLIST_HEAD(&phi_hash[i].hash); + spin_lock_init(&phi_hash[i].lock); + } + + phi_cache =3D KMEM_CACHE(pghot_info, 0); + if (unlikely(!phi_cache)) { + ret =3D -ENOMEM; + goto out; + } + + phi_heap_entries =3D hash_entries * PGHOT_HEAP_PCT / 100; + for_each_node_state(nid, N_CPU) { + ret =3D kpromoted_run(nid); + if (ret) + goto out_stop_kthread; + } + + register_sysctl_init("vm", pghot_sysctls); + kpromoted_started =3D true; + pr_info("pghot: Started page hotness monitoring and promotion thread\n"); + pr_info("pghot: nr_pages %ld hash_size %d hash_entries %ld hash_order %d = heap_entries %d\n", + nr_pages, hash_size, hash_entries, phi_hash_order, phi_heap_entrie= s); + return 0; + +out_stop_kthread: + for_each_node_state(nid, N_CPU) { + pgdat =3D NODE_DATA(nid); + if (pgdat->kpromoted) { + kthread_stop(pgdat->kpromoted); + pgdat->kpromoted =3D NULL; + vfree(pgdat->phi_buf); + } + } +out: + kmem_cache_destroy(phi_cache); + vfree(phi_hash); + return ret; +} + +late_initcall(pghot_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index 71cd1ceba191..ee122c2cd137 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1494,6 +1494,15 @@ const char * const vmstat_text[] =3D { [I(KSTACK_REST)] =3D "kstack_rest", #endif #endif + [I(PGHOT_RECORDED_ACCESSES)] =3D "pghot_recorded_accesses", + [I(PGHOT_RECORD_HWHINTS)] =3D "pghot_recorded_hwhints", + [I(PGHOT_RECORD_PGTSCANS)] =3D "pghot_recorded_pgtscans", + [I(PGHOT_RECORD_HINTFAULTS)] =3D "pghot_recorded_hintfaults", + [I(PGHOT_RECORDS_HASH)] =3D "pghot_records_hash", + [I(PGHOT_RECORDS_HEAP)] =3D "pghot_records_heap", + [I(KPROMOTED_RIGHT_NODE)] =3D "kpromoted_right_node", + [I(KPROMOTED_NON_LRU)] =3D "kpromoted_non_lru", + [I(KPROMOTED_DROPPED)] =3D "kpromoted_dropped", #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2076.outbound.protection.outlook.com [40.107.102.76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BEAD6338F30 for ; Wed, 10 Sep 2025 14:49:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.102.76 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515783; cv=fail; b=f7rRpShODcBx+UuZKjQb0bHQKyC6zu/iaxLG1KNZZXRZBdgtyJ+DXjHgwTSsVE0GoB/JOJRAh93JKblkYxbvTf3D23SvUWupiGpctLmlaAVp4aKmihoFu5kACl8tevzc5Dt2VUjT3mEEomuxvsIfqiGNGWEY5dk7vWHi/YgV1n8= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515783; c=relaxed/simple; bh=Yo1oIP58FEtjlxat0LHaT4RHBjwL5D5YeWS9mcfSmk8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=S38Yx4GlzsKToE/r9P2qFXNrN5xYrDJ4IrZuUQTv/aXNht3zf5cw5SEyjgCDkFX2pgL+WjYa2m9mIFa9IGZWO+LxUA6E/wLwp8LICD7efGHMvcZgdkEjwrQtcFBYdWTF93wVCvKVUuDunf0fd2pyumIOGvG+4BEANFay1qL99jM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=YWmvWxOB; arc=fail smtp.client-ip=40.107.102.76 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="YWmvWxOB" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ymNk5KjOPEvJ4q5FMdo7/udAUTG1CoIlFXpz3TSiV1QGmdi3NxH57+2u8lgpmJ9YnV/YNNpnPu4Rwjr/3CYVaxOXpqih3neuBJvd/x4KHjE9n2m2NiatvbBRRBNwHqg9pN2i+RdtBl7yC83qrD7/ekzpdjQ5IynkX3lc118BxH/Hrhl244RI18gn9JaX8i4g8thx1DEEb4Ak6ZqhK53bkDQ2edXz7wmHR35oGznWs73+e4R//86TM4vgyiniijyER8PsGGnqQeufIcV4zjBB1xO4/Et3BHPfENQhG5/VOOvRQdWnVyiRu6Au7UTTRtTwTZEe7C0bMUM2DMkf/3NBiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ZW8f3mmtasve4jS7nsozPRz6pYKyX6pnI97HF4fqtEs=; b=LXjWRJ6OaOrDKoKSBiBOALSUr0uQtd7HfWXo3ChsG7zrHW2ou1+AOr1JO6KDNE76eTXlVuRjmo9uHMewi6/QJ3c0bz4UTjAHSd0hLgXinBbGIPub2QC+fkOPQNzVhBbYINhQzJ3qWiJUdXQAXMrmmO5KiCPY4EtX8TSJrb+r1r5fOv02J308d1XP6ZszaIBHXJzA5f9OsTFPDPPC1Dvz/nF6qAXpvuG1lBouehhn0jC+wnB6+XQ3x8q1NbfRsjrR96KdAXFqdgxpbxNeurffN22HUV3XDgztYK0wZ45ZkrGgZDTviZE8lruKOX1oZr4re4z+5bXmikm6YWUr2z7kMQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=temperror (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=temperror action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ZW8f3mmtasve4jS7nsozPRz6pYKyX6pnI97HF4fqtEs=; b=YWmvWxOB97N+FWIsoN/r3jx9Vl1KND4wSRdi6RIHFZKCBSiYT4NVxO4EF2Y8RL5tMNul/GPTY2ljrtwKx/x+IosWEyslcnoKhddSjRDF3ZTF1iOX4DJLuAaYqw6QUpR1/kgdU4XNVutnr86X4E36leElXNwIPCZfDfq0WMOmdBc= Received: from SN7PR04CA0225.namprd04.prod.outlook.com (2603:10b6:806:127::20) by LV3PR12MB9168.namprd12.prod.outlook.com (2603:10b6:408:19a::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:49:38 +0000 Received: from SA2PEPF00001505.namprd04.prod.outlook.com (2603:10b6:806:127:cafe::42) by SN7PR04CA0225.outlook.office365.com (2603:10b6:806:127::20) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9115.15 via Frontend Transport; Wed, 10 Sep 2025 14:49:37 +0000 X-MS-Exchange-Authentication-Results: spf=temperror (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=temperror action=none header.from=amd.com; Received-SPF: TempError (protection.outlook.com: error in processing during lookup of amd.com: DNS Timeout) Received: from satlexmb07.amd.com (165.204.84.17) by SA2PEPF00001505.mail.protection.outlook.com (10.167.242.37) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:49:36 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:49:26 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Date: Wed, 10 Sep 2025 20:16:49 +0530 Message-ID: <20250910144653.212066-5-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF00001505:EE_|LV3PR12MB9168:EE_ X-MS-Office365-Filtering-Correlation-Id: 320418d7-c0e4-4f1d-1c29-08ddf079434a X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|376014|7416014|36860700013|82310400026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?swmZ85A6zBhwxqrxZ02QBSOTq2Aq6quidWtLH9QG5pkTlum6SSsmpq2WyB1g?= =?us-ascii?Q?1zZiNrxNnZPfpNOjCe69KYoUadR4Z24I7Yh5uSEBQ1UhMKIHC+F87I0R4P4W?= =?us-ascii?Q?KuZPCrYKwtsITZp7KXqAKHjKdw2CLaZgXOrRUnpz0hlx9PrXh5twVo8MDOEX?= =?us-ascii?Q?eZArTxem6cw8f4oWeNblvfF+3s2XWnCaWbL0VkQCIGb+9jMDyqzcoxgXmBAT?= =?us-ascii?Q?u0M81P+tJzubRZy2cH2Tg6LJ8lQaR5wpvct7/sKIlZa3I3u2qUzdogVdSAPK?= =?us-ascii?Q?SHnHTVqMT0evSrW1g0twdUXHednhO9Ggph0rbAtl7tQivX4daCV2HQpudBB0?= =?us-ascii?Q?oINYaDGQDZIi0OtD5Ld1SD/3ysdb2/mn1ubUWNFEpuGHOOVKzSUb8mdraTVZ?= =?us-ascii?Q?D1I0TfyvS7mUbhXBln1wcJQ8pzwaP0S57Z5e9ogzdF3NNRANGZP1d0eABNwB?= =?us-ascii?Q?MFArQ7Tw6Epu6Xsj6PI+Ip+R18PZv+vEvrrLnceLX+0/5Bd6uuzjgQHSZZWM?= =?us-ascii?Q?wtBMSIepYNrq9O4UNh04gWLDYrg/jGVC9u9NIsenX0oZDVfDFVhKghHYThas?= =?us-ascii?Q?RZ5FH96QhTHe2J+gpbplMAv3DgUfkOG9rXYfnFx1UYl5E0KpyNYF5dQ3n2nJ?= =?us-ascii?Q?lYZ/YzrTxyiHo6YeIAQCHGjXTf/uJU3FcjJxUZiw5GC3Km2TS/RK7hYomMn2?= =?us-ascii?Q?CKS0ZZTeGtFgTcCpIO/WxPnFBLhiRd/TT5cBnFNACH3iovcgW/An/hjxr8JS?= =?us-ascii?Q?hDw2t35dQdUkEwADnWihjvrBBrXcXBYPLwBpOnjAKyOe/owt6mPDnu7noYUR?= =?us-ascii?Q?jKSkLL3aefjd9lZw17t6YemCEAL7Xp8P/yWVkth1FETtqoAFuvdSYTSDJLqx?= =?us-ascii?Q?/n8y5qIGIo2/HhO171Us7EssiFGMhOkT2tsj0I67Q/MwyvNdY3+GAOy3ZEyj?= =?us-ascii?Q?w2bMzSXfUDCnzLAOun68uslof8Rzwq4OZKlf7BYQKPkxF7H47gs5/+jhgekD?= =?us-ascii?Q?RxxYb6n3W3lRevXIBYVLdxTzk3qIC18niPwqPmCL3eg3frnXNwchf0SzDZke?= =?us-ascii?Q?PBN/xJtb2C6clVndJ5mMcXyY0QGeQbejmawkxE5r1yqpSA68BzcPZTvaQx6g?= =?us-ascii?Q?D6+FOOxX9wlgoj7qdEp4TkSfYR7abfbsIxPdT89VfxNG1erI84ZZg6h3ivYZ?= =?us-ascii?Q?A1TeO4TLNKkgI1wZyhy7uZhJCY4lYqPaadUa7WzxSzTVX1o2L/Kb2TtyTVVJ?= =?us-ascii?Q?8xjqf0aK7uTumXrHz4SI7QycIfdbJfG8sInTsQrgW4sB1rLaV3EpszafAA/S?= =?us-ascii?Q?Kk4YVqKYkzC6nt5/qKBDUu0QxyJZiyeQSALJeLsLVilocOuTZkii98ZnZ9Yo?= =?us-ascii?Q?h09uv6QsHOFdh6XKATPXRA8BKolQmrtVuPIGm2/3wr9Fk+0XV6NFsDlF/joL?= =?us-ascii?Q?7zAdHdDJo1nWM0KCxh42dvxB3hnnlQP8ENnoFI4UkSUFI6gwqYKc4A=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(376014)(7416014)(36860700013)(82310400026);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:49:36.6507 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 320418d7-c0e4-4f1d-1c29-08ddf079434a X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF00001505.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: LV3PR12MB9168 Content-Type: text/plain; charset="utf-8" Use IBS (Instruction Based Sampling) feature present in AMD processors for memory access tracking. The access information obtained from IBS via NMI is fed to kpromoted daemon for futher action. In addition to many other information related to the memory access, IBS provides physical (and virtual) address of the access and indicates if the access came from slower tier. Only memory accesses originating from slower tiers are further acted upon by this driver. The samples are initially accumulated in percpu buffers which are flushed to pghot hot page tracking mechanism using irq_work. TODO: Many counters are added to vmstat just as debugging aid for now. About IBS --------- IBS can be programmed to provide data about instruction execution periodically. This is done by programming a desired sample count (number of ops) in a control register. When the programmed number of ops are dispatched, a micro-op gets tagged, various information about the tagged micro-op's execution is populated in IBS execution MSRs and an interrupt is raised. While IBS provides a lot of data for each sample, for the purpose of memory access profiling, we are interested in linear and physical address of the memory access that reached DRAM. Recent AMD processors provide further filtering where it is possible to limit the sampling to those ops that had an L3 miss which greately reduces the non-useful samples. While IBS provides capability to sample instruction fetch and execution, only IBS execution sampling is used here to collect data about memory accesses that occur during the instruction execution. More information about IBS is available in Sec 13.3 of AMD64 Architecture Programmer's Manual, Volume 2:System Programming which is present at: https://bugzilla.kernel.org/attachment.cgi?id=3D288923 Information about MSRs used for programming IBS can be found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h Model 11h B1 which is currently present at: https://www.amd.com/system/files/TechDocs/55901_0.25.zip Signed-off-by: Bharata B Rao --- arch/x86/events/amd/ibs.c | 11 ++ arch/x86/include/asm/ibs.h | 7 + arch/x86/include/asm/msr-index.h | 16 ++ arch/x86/mm/Makefile | 3 +- arch/x86/mm/ibs.c | 311 +++++++++++++++++++++++++++++++ include/linux/vm_event_item.h | 17 ++ mm/vmstat.c | 17 ++ 7 files changed, 381 insertions(+), 1 deletion(-) create mode 100644 arch/x86/include/asm/ibs.h create mode 100644 arch/x86/mm/ibs.c diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index 112f43b23ebf..1498dc9caeb2 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -13,9 +13,11 @@ #include #include #include +#include =20 #include #include +#include =20 #include "../perf_event.h" =20 @@ -1756,6 +1758,15 @@ static __init int amd_ibs_init(void) { u32 caps; =20 + /* + * TODO: Find a clean way to disable perf IBS so that IBS + * can be used for memory access profiling. + */ + if (arch_hw_access_profiling) { + pr_info("IBS isn't available for perf use\n"); + return 0; + } + caps =3D __get_ibs_caps(); if (!caps) return -ENODEV; /* ibs not supported by the cpu */ diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h new file mode 100644 index 000000000000..b5a4f2ca6330 --- /dev/null +++ b/arch/x86/include/asm/ibs.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_X86_IBS_H +#define _ASM_X86_IBS_H + +extern bool arch_hw_access_profiling; + +#endif /* _ASM_X86_IBS_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-in= dex.h index b65c3ba5fa14..55d26380550c 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -742,6 +742,22 @@ /* AMD Last Branch Record MSRs */ #define MSR_AMD64_LBR_SELECT 0xc000010e =20 +/* AMD IBS MSR bits */ +#define MSR_AMD64_IBSOPDATA2_DATASRC 0x7 +#define MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE 0x1 +#define MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR 0x2 +#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM 0x3 +#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE 0x5 +#define MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM 0x8 +#define MSR_AMD64_IBSOPDATA2_RMTNODE 0x10 + +#define MSR_AMD64_IBSOPDATA3_LDOP BIT_ULL(0) +#define MSR_AMD64_IBSOPDATA3_STOP BIT_ULL(1) +#define MSR_AMD64_IBSOPDATA3_DCMISS BIT_ULL(7) +#define MSR_AMD64_IBSOPDATA3_LADDR_VALID BIT_ULL(17) +#define MSR_AMD64_IBSOPDATA3_PADDR_VALID BIT_ULL(18) +#define MSR_AMD64_IBSOPDATA3_L2MISS BIT_ULL(20) + /* Zen4 */ #define MSR_ZEN4_BP_CFG 0xc001102e #define MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT 4 diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 5b9908f13dcf..967e5af9eba9 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -22,7 +22,8 @@ CFLAGS_REMOVE_pgprot.o =3D -pg endif =20 obj-y :=3D init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \ - pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o + pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o \ + ibs.o =20 obj-y +=3D pat/ =20 diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c new file mode 100644 index 000000000000..6669710dd35b --- /dev/null +++ b/arch/x86/mm/ibs.c @@ -0,0 +1,311 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +#include +#include /* TODO: Move defns like IBS_OP_ENABLE into no= n-perf header */ +#include +#include + +bool arch_hw_access_profiling; +static u64 ibs_config __read_mostly; +static u32 ibs_caps; + +#define IBS_NR_SAMPLES 150 + +/* + * Basic access info captured for each memory access. + */ +struct ibs_sample { + unsigned long pfn; + unsigned long time; /* jiffies when accessed */ + int nid; /* Accessing node ID, if known */ +}; + +/* + * Percpu buffer of access samples. Samples are accumulated here + * before pushing them to kpromoted for further action. + */ +struct ibs_sample_pcpu { + struct ibs_sample samples[IBS_NR_SAMPLES]; + int head, tail; +}; + +struct ibs_sample_pcpu __percpu *ibs_s; + +/* + * The workqueue for pushing the percpu access samples to kpromoted. + */ +static struct work_struct ibs_work; +static struct irq_work ibs_irq_work; + +/* + * Record the IBS-reported access sample in percpu buffer. + * Called from IBS NMI handler. + */ +static int ibs_push_sample(unsigned long pfn, int nid, unsigned long time) +{ + struct ibs_sample_pcpu *ibs_pcpu =3D raw_cpu_ptr(ibs_s); + int next =3D ibs_pcpu->head + 1; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + if (next =3D=3D ibs_pcpu->tail) + return 0; + + ibs_pcpu->samples[ibs_pcpu->head].pfn =3D pfn; + ibs_pcpu->samples[ibs_pcpu->head].time =3D time; + ibs_pcpu->head =3D next; + return 1; +} + +static int ibs_pop_sample(struct ibs_sample *s) +{ + struct ibs_sample_pcpu *ibs_pcpu =3D raw_cpu_ptr(ibs_s); + + int next =3D ibs_pcpu->tail + 1; + + if (ibs_pcpu->head =3D=3D ibs_pcpu->tail) + return 0; + + if (next >=3D IBS_NR_SAMPLES) + next =3D 0; + + *s =3D ibs_pcpu->samples[ibs_pcpu->tail]; + ibs_pcpu->tail =3D next; + return 1; +} + +/* + * Remove access samples from percpu buffer and send them + * to kpromoted for further action. + */ +static void ibs_work_handler(struct work_struct *work) +{ + struct ibs_sample s; + + while (ibs_pop_sample(&s)) + pghot_record_access(s.pfn, s.nid, PGHOT_HW_HINTS, s.time); +} + +static void ibs_irq_handler(struct irq_work *i) +{ + schedule_work_on(smp_processor_id(), &ibs_work); +} + +/* + * IBS NMI handler: Process the memory access info reported by IBS. + * + * Reads the MSRs to collect all the information about the reported + * memory access, validates the access, stores the valid sample and + * schedules the work on this CPU to further process the sample. + */ +static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs) +{ + struct mm_struct *mm =3D current->mm; + u64 ops_ctl, ops_data3, ops_data2; + u64 laddr =3D -1, paddr =3D -1; + u64 data_src, rmt_node; + struct page *page; + unsigned long pfn; + + rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl); + + /* + * When IBS sampling period is reprogrammed via read-modify-update + * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with + * IBS_OP_ENABLE not set. For such cases, return as HANDLED. + * + * With this, the handler will say "handled" for all NMIs that + * aren't related to this NMI. This stems from the limitation of + * having both status and control bits in one MSR. + */ + if (!(ops_ctl & IBS_OP_VAL)) + goto handled; + + wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL); + + count_vm_event(HWHINT_NR_EVENTS); + + if (!user_mode(regs)) { + count_vm_event(HWHINT_KERNEL); + goto handled; + } + + if (!mm) { + count_vm_event(HWHINT_KTHREAD); + goto handled; + } + + rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3); + + /* Load/Store ops only */ + /* TODO: DataSrc isn't valid for stores, so filter out stores? */ + if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP | + MSR_AMD64_IBSOPDATA3_STOP))) { + count_vm_event(HWHINT_NON_LOAD_STORES); + goto handled; + } + + /* Discard the sample if it was L1 or L2 hit */ + if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS | + MSR_AMD64_IBSOPDATA3_L2MISS))) { + count_vm_event(HWHINT_DC_L2_HITS); + goto handled; + } + + rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2); + data_src =3D ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC; + if (ibs_caps & IBS_CAPS_ZEN4) + data_src |=3D ((ops_data2 & 0xC0) >> 3); + + switch (data_src) { + case MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE: + count_vm_event(HWHINT_LOCAL_L3L1L2); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR: + count_vm_event(HWHINT_LOCAL_PEER_CACHE_NEAR); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_DRAM: + count_vm_event(HWHINT_DRAM_ACCESSES); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM: + count_vm_event(HWHINT_CXL_ACCESSES); + break; + case MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE: + count_vm_event(HWHINT_FAR_CACHE_HITS); + break; + } + + rmt_node =3D ops_data2 & MSR_AMD64_IBSOPDATA2_RMTNODE; + if (rmt_node) + count_vm_event(HWHINT_REMOTE_NODE); + + /* Is linear addr valid? */ + if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID) + rdmsrl(MSR_AMD64_IBSDCLINAD, laddr); + else { + count_vm_event(HWHINT_LADDR_INVALID); + goto handled; + } + + /* Discard kernel address accesses */ + if (laddr & (1UL << 63)) { + count_vm_event(HWHINT_KERNEL_ADDR); + goto handled; + } + + /* Is phys addr valid? */ + if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID) + rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr); + else { + count_vm_event(HWHINT_PADDR_INVALID); + goto handled; + } + + pfn =3D PHYS_PFN(paddr); + page =3D pfn_to_online_page(pfn); + if (!page) + goto handled; + + if (!PageLRU(page)) { + count_vm_event(HWHINT_NON_LRU); + goto handled; + } + + if (!ibs_push_sample(pfn, numa_node_id(), jiffies)) { + count_vm_event(HWHINT_BUFFER_FULL); + goto handled; + } + + irq_work_queue(&ibs_irq_work); + count_vm_event(HWHINT_USEFUL_SAMPLES); + +handled: + return NMI_HANDLED; +} + +static inline int get_ibs_lvt_offset(void) +{ + u64 val; + + rdmsrl(MSR_AMD64_IBSCTL, val); + if (!(val & IBSCTL_LVT_OFFSET_VALID)) + return -EINVAL; + + return val & IBSCTL_LVT_OFFSET_MASK; +} + +static void setup_APIC_ibs(void) +{ + int offset; + + offset =3D get_ibs_lvt_offset(); + if (offset < 0) + goto failed; + + if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0)) + return; +failed: + pr_warn("IBS APIC setup failed on cpu #%d\n", + smp_processor_id()); +} + +static void clear_APIC_ibs(void) +{ + int offset; + + offset =3D get_ibs_lvt_offset(); + if (offset >=3D 0) + setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1); +} + +static int x86_amd_ibs_access_profile_startup(unsigned int cpu) +{ + setup_APIC_ibs(); + return 0; +} + +static int x86_amd_ibs_access_profile_teardown(unsigned int cpu) +{ + clear_APIC_ibs(); + return 0; +} + +static int __init ibs_access_profiling_init(void) +{ + if (!boot_cpu_has(X86_FEATURE_IBS)) { + pr_info("IBS capability is unavailable for access profiling\n"); + return 0; + } + + ibs_s =3D alloc_percpu_gfp(struct ibs_sample_pcpu, GFP_KERNEL | __GFP_ZER= O); + if (!ibs_s) + return 0; + + INIT_WORK(&ibs_work, ibs_work_handler); + init_irq_work(&ibs_irq_work, ibs_irq_handler); + + /* Uses IBS Op sampling */ + ibs_config =3D IBS_OP_CNT_CTL | IBS_OP_ENABLE; + ibs_caps =3D cpuid_eax(IBS_CPUID_FEATURES); + if (ibs_caps & IBS_CAPS_ZEN4) + ibs_config |=3D IBS_OP_L3MISSONLY; + + register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs"); + + cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING, + "x86/amd/ibs_access_profile:starting", + x86_amd_ibs_access_profile_startup, + x86_amd_ibs_access_profile_teardown); + + pr_info("IBS setup for memory access profiling\n"); + return 0; +} + +arch_initcall(ibs_access_profiling_init); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index a996fa9df785..bca57b05766d 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -195,6 +195,23 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KPROMOTED_RIGHT_NODE, KPROMOTED_NON_LRU, KPROMOTED_DROPPED, + HWHINT_NR_EVENTS, + HWHINT_KERNEL, + HWHINT_KTHREAD, + HWHINT_NON_LOAD_STORES, + HWHINT_DC_L2_HITS, + HWHINT_LOCAL_L3L1L2, + HWHINT_LOCAL_PEER_CACHE_NEAR, + HWHINT_FAR_CACHE_HITS, + HWHINT_DRAM_ACCESSES, + HWHINT_CXL_ACCESSES, + HWHINT_REMOTE_NODE, + HWHINT_LADDR_INVALID, + HWHINT_KERNEL_ADDR, + HWHINT_PADDR_INVALID, + HWHINT_NON_LRU, + HWHINT_BUFFER_FULL, + HWHINT_USEFUL_SAMPLES, NR_VM_EVENT_ITEMS }; =20 diff --git a/mm/vmstat.c b/mm/vmstat.c index ee122c2cd137..aa743708c79b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1503,6 +1503,23 @@ const char * const vmstat_text[] =3D { [I(KPROMOTED_RIGHT_NODE)] =3D "kpromoted_right_node", [I(KPROMOTED_NON_LRU)] =3D "kpromoted_non_lru", [I(KPROMOTED_DROPPED)] =3D "kpromoted_dropped", + [I(HWHINT_NR_EVENTS)] =3D "hwhint_nr_events", + [I(HWHINT_KERNEL)] =3D "hwhint_kernel", + [I(HWHINT_KTHREAD)] =3D "hwhint_kthread", + [I(HWHINT_NON_LOAD_STORES)] =3D "hwhint_non_load_stores", + [I(HWHINT_DC_L2_HITS)] =3D "hwhint_dc_l2_hits", + [I(HWHINT_LOCAL_L3L1L2)] =3D "hwhint_local_l3l1l2", + [I(HWHINT_LOCAL_PEER_CACHE_NEAR)] =3D "hwhint_local_peer_cache_near", + [I(HWHINT_FAR_CACHE_HITS)] =3D "hwhint_far_cache_hits", + [I(HWHINT_DRAM_ACCESSES)] =3D "hwhint_dram_accesses", + [I(HWHINT_CXL_ACCESSES)] =3D "hwhint_cxl_accesses", + [I(HWHINT_REMOTE_NODE)] =3D "hwhint_remote_node", + [I(HWHINT_LADDR_INVALID)] =3D "hwhint_invalid_laddr", + [I(HWHINT_KERNEL_ADDR)] =3D "hwhint_kernel_addr", + [I(HWHINT_PADDR_INVALID)] =3D "hwhint_invalid_paddr", + [I(HWHINT_NON_LRU)] =3D "hwhint_non_lru", + [I(HWHINT_BUFFER_FULL)] =3D "hwhint_buffer_full", + [I(HWHINT_USEFUL_SAMPLES)] =3D "hwhint_useful_samples", #undef I #endif /* CONFIG_VM_EVENT_COUNTERS */ }; --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2068.outbound.protection.outlook.com [40.107.223.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B30583376B7 for ; Wed, 10 Sep 2025 14:50:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.223.68 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515817; cv=fail; b=uyqXPcy3GbC7/Xvpt3HGP9sw0Eha59mT/eiZ8bSI0zcpc9GW/ldcAJG9aPPIqtMTNkMU/u+7NvwMaKdn6tpf3pY2Eer7fmBa9gNVeH6hUrQZTpw0bNzd+vadC7rNx2SO9hghg7R5MWhhO5eIqEE7Rt8/vuRscr7Accsa7ru2sqs= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515817; c=relaxed/simple; bh=tz/Mrn/Feh0Fa5jxTnssaiqWy1CWRwxr/OaLxo34wDM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=PxX1FEeXIL2B3iSlTasN8wMIjO1AVBGBLehJ8l575iUEvVhNGuz053nlWBuPRlu/+4ZO9UJdXP2KAtnZoqtQbezgI2LLtHN0sbYo/UJVaHMMuGRPQHs5gfB0rg5tWRUsb8a+qYc08Wx0ozm4pqMzQ+hqXbe1L1FTdc491MhyeIg= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Z5Gyve79; arc=fail smtp.client-ip=40.107.223.68 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Z5Gyve79" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=OxNDcJlK+Q5H74QT+JBzR96l1q2uIEv+IWgO04vcYhekIgGclnKC7cu+IVIafFIdTLeDo/UIap0K/oQ//tE2aLNs9jZvzBj9XjQ9/mVY0ZlEz3xvUcNiAwYCcZT3ANzNNBrfDYB9N47tRhfZ4qikf8n4E7v5ywy050o5KNOu5RiUHi22yjloBireLelT+vvhIq76orpCobDmpaBgeYiJsUoOwIuqIr0PGfa6mEdte3BxlF+eAIl2oC8b/duBs4OdVxpsMOCnrfqL6gUtV71muaLFTry98h5yr0lo5Ziwm+CM1gbWhMLS1cMp+lTUC9ewAY/KsJeiW4HVxWtfAiMCvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=dXCLP7LS/KfXDafwpQvsPCw+JGNumX2OyTonQaCXLh8=; b=mR5HIgxfSICrEqDkmO9zSDqbVUqq4ZkgANiuNHhxGtvbNoV+3/klBDYZKADROUvb+zxPNB2w7yUdPNI7nizeL6NchGADtYtDLXb1yQwovsCLl6JRoAKGbtXAf+FM7z4oK39X7ZC1eloDYOhSK0lWh9sn0SF0B2ljrYiGuH2yZfkkoCQeOOxRiVd8RUErxA04VUtdsNUAJHWWi4bx4bczJvCVst4SNVZmmUQfzg21Xty86j84iIN6C4JjCdplu7vYuvHYqMJavZS2EmisXiUAOgs59o6Lm5THyDeUJDmTD9iweoYfCz/M6PMPsRLa1FW1yInwmXz/Oczz2vTB32xpAA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=dXCLP7LS/KfXDafwpQvsPCw+JGNumX2OyTonQaCXLh8=; b=Z5Gyve79DUwkairfmbKDpOrOTWz2hj9D/BnUZVrE20N5c7hpIwEKv3O+fQo1MCTWhXqD/9nmgWy20gfFHjfWwBcDscv1I9d0p8mmGn9UCcIZgWipqi1R03+p5FxsYiNU28yKVmXfI3JxmRZQ1ZFjVFJE+Y9mfBLrEr2OpVjIqKE= Received: from SN7P220CA0027.NAMP220.PROD.OUTLOOK.COM (2603:10b6:806:123::32) by CH3PR12MB8969.namprd12.prod.outlook.com (2603:10b6:610:17c::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:50:12 +0000 Received: from SA2PEPF00001509.namprd04.prod.outlook.com (2603:10b6:806:123:cafe::a8) by SN7P220CA0027.outlook.office365.com (2603:10b6:806:123::32) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9094.22 via Frontend Transport; Wed, 10 Sep 2025 14:50:12 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SA2PEPF00001509.mail.protection.outlook.com (10.167.242.41) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:50:12 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:49:54 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Date: Wed, 10 Sep 2025 20:16:50 +0530 Message-ID: <20250910144653.212066-6-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF00001509:EE_|CH3PR12MB8969:EE_ X-MS-Office365-Filtering-Correlation-Id: 1c1c57dc-5724-4217-f6b0-08ddf0795886 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|7416014|376014|36860700013|82310400026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?fF5GkmoqLg4OZU3k2Vh7EB8u/tq0rhPsgSMa3KqnzChwemSzBkpZWBBBL/br?= =?us-ascii?Q?T2hQcnlTFmZONvYsAzMH220J3nv8kN85QQM157euBTSCFINZcWNQSnVF8CWT?= =?us-ascii?Q?gIZKJX5FLDPx8jDKpkeeei96EDXiUHpvXjBH75vNZNtLpQfvx/qJrETRYYRw?= =?us-ascii?Q?V/LhQHfWRAwMSbEf75K9+CJ7YYNROBvi2A7BXUJfLrQlgwMsUzb7hoQMwDF7?= =?us-ascii?Q?v4N504j6zsO0d2RJofdwD6jqMMbUx0z5pJ4dIEqmwZkYmfeZpl+oCRLhGUpT?= =?us-ascii?Q?wPThMvzVbJ1TNEHO/wGT/9ma+Pm/xRZiXVBwuuUdg+312336bv+4zxh+9sBZ?= =?us-ascii?Q?3+15iJYygtH3sg324w7ar5xSr1mhmfhuadFsF/EDta7mUjgIr3lLdTkEiVLJ?= =?us-ascii?Q?4Ty2idFfXfdQX2el9SnsoTDCUQg3TAiDzNcFjGXNKyopXt8OzBnjPpdvtTb/?= =?us-ascii?Q?yqlk8YR2g1Kpd/QwiLiMcPMePOs09BfjfU7awTSITWDDsPusclESWfi3Y6yp?= =?us-ascii?Q?zZQTBjo5xklti9zTDaZfEBhkaZJ2Hnfo/d5dEEm4/di1Etlpic/myYrbY92D?= =?us-ascii?Q?JpAnulh1+huSE8PIuq/YL12On4JHwfuVh6DxEliK9r/UdXFjtr0JAlyKxeEr?= =?us-ascii?Q?POxQKKtuX0um6F+gFwIOhL+iA6VUAczEaE6/Pon4TsnO/1CpW0IZFNdgxPFP?= =?us-ascii?Q?Nn8/ck/4e91QR33R+mcBzlJJOPKK/z0TkWQRA6G4aRlYY2L3CixW7nAR9gmz?= =?us-ascii?Q?4JcPViPfR0RE4FYUIFTAMaSEfVXWoiOZHccvE4xXWQjH0MV23qAdD1HJ0Crc?= =?us-ascii?Q?maZbmBrWgkf62lMbPQgANf0MO5NXk27nypSIEMxyIAgxvMIyre7rpR0S5kvz?= =?us-ascii?Q?TDtVzil4g9fZqIoRwUxv2zr02FP7lKdJ0PKubCUQjR/CCQq6csAkh/Xdfhj1?= =?us-ascii?Q?afqhq2h6hwKFZ6m8SlD9kcToGbm/vuGV19Fm8A4+bjJuCllRbxpB7UgIvnWY?= =?us-ascii?Q?JJJP2MLOALZS1in15RbAwG3t6BRd9bBW9zEc35l6swjF9MLqqBNi817i53ne?= =?us-ascii?Q?Ryh+Hy5Yc3aqZospk2vxcwRuIpC8eK29QiEfFfHNlF6RCABWScNh4zmjXzQL?= =?us-ascii?Q?4Oc7C6ry35r2fmD77J5doQO1AqGAL6WFVXyUQ3ZtZHUkZUBL1mNKq80Y6ude?= =?us-ascii?Q?ctsXbkvvZyr6CqeupLYYkCv/YEQqSDqGsDS+uJeZ83vH49Sxbc/Tz62/Kkq7?= =?us-ascii?Q?VQZDttcDRKMVCVmokIZpBQW3RYI2EZ9MuImLKi6rmb36HU9JWlCJDbkL92ev?= =?us-ascii?Q?QgD28lPdA5SPU86gHkwkP4oh4b5PRPOmFOWPs9QJl24FqEaL+G/KKVGXOv30?= =?us-ascii?Q?/+54q5SDq54gyJ0iVDef31u9VH91gxUiYlsTkIgg9FqdQXKtIO4gjfrx9JlP?= =?us-ascii?Q?0CEKK4IzN2UjzHvhSACfrRlAOwjOPWVUvt5BvCaR0ufGgw4m3c3tf+Bo2h1I?= =?us-ascii?Q?y9qybfUbgB+k+2jcnoX8v10s5lPdal0x6i71?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(7416014)(376014)(36860700013)(82310400026);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:50:12.2724 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 1c1c57dc-5724-4217-f6b0-08ddf0795886 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF00001509.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB8969 Content-Type: text/plain; charset="utf-8" Enable IBS memory access data collection for user memory accesses by programming the required MSRs. The profiling is turned ON only for user mode execution and turned OFF for kernel mode execution. Profiling is explicitly disabled for NMI handler too. TODOs: - IBS sampling rate is kept fixed for now. - Arch/vendor separation/isolation of the code needs relook. Signed-off-by: Bharata B Rao --- arch/x86/include/asm/entry-common.h | 3 +++ arch/x86/include/asm/hardirq.h | 2 ++ arch/x86/include/asm/ibs.h | 2 ++ arch/x86/mm/ibs.c | 32 +++++++++++++++++++++++++++++ 4 files changed, 39 insertions(+) diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/ent= ry-common.h index d535a97c7284..7144b57d209b 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -9,10 +9,12 @@ #include #include #include +#include =20 /* Check that the stack and regs on entry from user mode are sane. */ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) { + hw_access_profiling_stop(); if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) { /* * Make sure that the entry code gave us a sensible EFLAGS @@ -99,6 +101,7 @@ static inline void arch_exit_to_user_mode_prepare(struct= pt_regs *regs, static __always_inline void arch_exit_to_user_mode(void) { amd_clear_divider(); + hw_access_profiling_start(); } #define arch_exit_to_user_mode arch_exit_to_user_mode =20 diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index f00c09ffe6a9..0752cb6ebd7a 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -91,4 +91,6 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(vo= id) static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { } #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */ =20 +#define arch_nmi_enter() hw_access_profiling_stop() +#define arch_nmi_exit() hw_access_profiling_start() #endif /* _ASM_X86_HARDIRQ_H */ diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h index b5a4f2ca6330..6b480958534e 100644 --- a/arch/x86/include/asm/ibs.h +++ b/arch/x86/include/asm/ibs.h @@ -2,6 +2,8 @@ #ifndef _ASM_X86_IBS_H #define _ASM_X86_IBS_H =20 +void hw_access_profiling_start(void); +void hw_access_profiling_stop(void); extern bool arch_hw_access_profiling; =20 #endif /* _ASM_X86_IBS_H */ diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c index 6669710dd35b..3128e8fa5f39 100644 --- a/arch/x86/mm/ibs.c +++ b/arch/x86/mm/ibs.c @@ -16,6 +16,7 @@ static u64 ibs_config __read_mostly; static u32 ibs_caps; =20 #define IBS_NR_SAMPLES 150 +#define IBS_SAMPLE_PERIOD 10000 =20 /* * Basic access info captured for each memory access. @@ -98,6 +99,36 @@ static void ibs_irq_handler(struct irq_work *i) schedule_work_on(smp_processor_id(), &ibs_work); } =20 +void hw_access_profiling_stop(void) +{ + u64 ops_ctl; + + if (!arch_hw_access_profiling) + return; + + rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl); + wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE); +} + +void hw_access_profiling_start(void) +{ + u64 config =3D 0; + unsigned int period =3D IBS_SAMPLE_PERIOD; + + if (!arch_hw_access_profiling) + return; + + /* Disable IBS for kernel thread */ + if (!current->mm) + goto out; + + config =3D (period >> 4) & IBS_OP_MAX_CNT; + config |=3D (period & IBS_OP_MAX_CNT_EXT_MASK); + config |=3D ibs_config; +out: + wrmsrl(MSR_AMD64_IBSOPCTL, config); +} + /* * IBS NMI handler: Process the memory access info reported by IBS. * @@ -304,6 +335,7 @@ static int __init ibs_access_profiling_init(void) x86_amd_ibs_access_profile_startup, x86_amd_ibs_access_profile_teardown); =20 + arch_hw_access_profiling =3D true; pr_info("IBS setup for memory access profiling\n"); return 0; } --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2078.outbound.protection.outlook.com [40.107.237.78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B9F8932A821 for ; Wed, 10 Sep 2025 14:50:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.237.78 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515849; cv=fail; b=bYbqECgAALFCY8nG85MJQv8mRrp6LDBZZer1aaW2qxCAdk7oG1HP0VdwT83ELVZWwIIYHbCGiQVcOsL12AoMrituxe3EXRrElWfoSUOSZ9wM+wAgBeA+f47tneFYR2Pm75qMVmSy94sUCOIE2qg/NLcV6Sv+3Fk/wWUiBGPUFRY= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515849; c=relaxed/simple; bh=8GFCMFyZcygw91Iwpw0G1bqszElkPrIf0Soft7Q1hBM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=gbNDhuCLkmhKH9y3FBxAt4/wguuh++op7T7NUDIEaLSCm7WPj6RSEaU8WOZTKsNM7qNPE++PaBPzs5+VrJ/9egcpIKhvLDiNnBphPHE4bvzDEuy8AuhSI57RtzXNRUkP/jNzVlfPP/7MoYKdiXOLKno+KTwNUXhvvvFoZa8cYB0= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=U9RBzGoJ; arc=fail smtp.client-ip=40.107.237.78 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="U9RBzGoJ" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=lqQnViLDvUOCpozg56SVGuNkPLoYjfYCqlQfYi4PMY96cMRM9kNMjF+WrKXVWDoDDXWu/04FmAtXbU6Q2QeMlW8XaDowXOHOwjGFzibpXMA4sPptHaf/7lQaLLAzsA1+keYHUAtxh5CZmFlpWs+83wpXhqkOqNZ6/eE5HLy7pHp9tP04l1t62QTaHS5H232RoxGz80g6HMzGdG+x4MkG/yfbX5YKIwW08LthqViwnqPwd6gE1jryK+pVioqgfJI1i/6zS2okdSxEA6CXkJ+AJ5Tks07SIElBHWwRXeJohmrJeDshafqLQE1JK3ytLi/Mv2dKrs/5816rkn33kHj/qQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=En4DH7fO2JySYtzsyijxzDN2SB5CfkOTNNwCnytukWU=; b=M/kINP/9OeeyX4g7XB3ttU8YG8bivO991jjrmlH6009y6i3nNfzW/UNHvxbgsqoZXcLfWWmVZGW/mzp3x9CL1j6CYQYiF3XINJ4QfnOGcC6sVqES4OveDV3fGw8+v+yOZdf4PAbdkcLfOm6NbUqYjANRP0HjrM8BlBJ6Umj3qGrkGuBZfqmP5fTOEF/x4+9la7N4IwMvUSdcSUIxEBgNmZDw1Zz8s0WhGMUeqy5N6SCS2PHPWewDqgFDlHl3uUZ8aVh7NX7zBArit1Z+6BlTkFDEIK55/Cbhbi1c5Zh1VpfOLZ5M7OdjB7OYVulA/ioLfyLfoLxYQ86878KMRaKMpg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=En4DH7fO2JySYtzsyijxzDN2SB5CfkOTNNwCnytukWU=; b=U9RBzGoJ+pzeITZdFkI/I/FGT/TvYgA1bMM0FaLlQcKvx8zWLCoLj8N8qqgyX/bQfMFGWZRy6pLfuOx9R/wEJfmi+ojYq+vIAf9t/LEw9KqQAaWOruLfWyFnGT0GUGwZ1nlNja2jGcXX0kHmgZNBeXj0l0rp4kuXl0yJeoEiDQQ= Received: from MN2PR04CA0034.namprd04.prod.outlook.com (2603:10b6:208:d4::47) by PH7PR12MB5832.namprd12.prod.outlook.com (2603:10b6:510:1d7::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:50:33 +0000 Received: from BL02EPF00029927.namprd02.prod.outlook.com (2603:10b6:208:d4:cafe::74) by MN2PR04CA0034.outlook.office365.com (2603:10b6:208:d4::47) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9094.18 via Frontend Transport; Wed, 10 Sep 2025 14:50:33 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by BL02EPF00029927.mail.protection.outlook.com (10.167.249.52) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:50:33 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:50:22 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Date: Wed, 10 Sep 2025 20:16:51 +0530 Message-ID: <20250910144653.212066-7-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BL02EPF00029927:EE_|PH7PR12MB5832:EE_ X-MS-Office365-Filtering-Correlation-Id: fbed8f6e-7b8d-4be9-99f5-08ddf07964f4 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|36860700013|82310400026|1800799024|7416014|376014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?XO9CACAURuVs68nZhBkGXKV2DGEcKiutVK8tnExBw71tYGwEAiLxtdl1p6bH?= =?us-ascii?Q?bPkoOYcmzamS6zd2OIyYCbqtVHfQ7TpVVjrzhM3UK9zxm0mnVo7ByHw9AisQ?= =?us-ascii?Q?bC0nEQWaRBR+ADkSkiC/laSFPr9x4tyScEQyKU4A1fw+vVRYpHwz/OACcdWx?= =?us-ascii?Q?gAgr5mr8Puf37ieIaSjyr2DVoYrv62u1tvv9T2qQhOz8EMmwQx20dDKCDMbF?= =?us-ascii?Q?2V1vkhDTSuAr+e8ufTK9LOkhcctarJ3wzv95/5eClcf9UvhGxQCbPUK561Tk?= =?us-ascii?Q?65ImZ2bzYRb9/Xs3w+Grlt4qrvCwOpPwDMUktmvRYXlI5UyIsg8nj/Qh1wou?= =?us-ascii?Q?0RSnrQOAzig9Yt4Xgs3AxW0dmYPwDAnkHvra2fXE/jrtgGAuP+0RF7/PV/Y0?= =?us-ascii?Q?Rm92vDGxz6O0CK8vupwYzbtlojed1CFn33qHdl6DIBG0O9+/BcvUQriSHQDX?= =?us-ascii?Q?cVxeFgrBlO4iLtmcVt+baRb8I4+vamWdf1CTWTQIRiA0xU7a4QQrBFM6oyGy?= =?us-ascii?Q?3UUCXSAVtLb32ba5K/Uy82AoCPxS0z0DDUC+8/YIeLd0y7xVxT5Nvhn5zxwm?= =?us-ascii?Q?SODH7lo9w7V9NnzEA5wN7gVOZ5ZC3F6M6Fb+ICAdeHeOHgMjzSCUBbdeuRnO?= =?us-ascii?Q?HH9I+CHh+S/7nmURZ3veytuGPoW/lNU/jd0HiTtEoBBIX6eBALjjqagZ7tP8?= =?us-ascii?Q?JXXkEwSYyDh3O5AcHk0fXw0ZX727AxV3CeXzuq+F8djkuguiFqenJPxZkTQE?= =?us-ascii?Q?EjkqKRyp6DuwhH/Qv+dPpRtnmZJAA6dpv23uVpVhnDhPlpmD64EvxqBbPzAR?= =?us-ascii?Q?eGBBTpcwWKAleyiJyW3LW26ZK1qlwWLbo9YfkrQJ2XREtna+RszcmjacX42M?= =?us-ascii?Q?iI3vqf7czFlfIVZT52fTIg5s16NsNGDvPxGo2i78aRDMAcZLDTI3M/mQtOIJ?= =?us-ascii?Q?Ip4ZisHVCCW8iCzCUw4h8ODoVSdkomhCAXU8kP7KUNQDEPrPdaBIkGHW+U0T?= =?us-ascii?Q?6x1lML7j+pzmA0HU2yzZOK5Un7wgnWWoHsH/aD9MjWF8MXJo/T3pXG2uMzrj?= =?us-ascii?Q?Oq+BQ18nxcUWUXY1LMgRLu7GVZlQp+lJDzxwQVmfrv9C7Y8lPh7MNyDvQLrq?= =?us-ascii?Q?enR2WodUzm2plpEr9DEeVxHmNe7p2rbFXqIzq+GfZN6gY9gVlxR9RULbCp/g?= =?us-ascii?Q?mVm9pg0dc5yWaxaLy+G7RjoQcS2UiSC4RpNxdBVMQ6Bu7HcnK2gx5VZhSP+b?= =?us-ascii?Q?O+N+4zug7C5eVkdyMX/LT8RZNHG4SbaXQbxUN/rCgJ7PZmkS7fG/Zu5RHFXG?= =?us-ascii?Q?f70i3EIYIPQGVNospskceMkHW3pH1ALzpVoDAecKjQ/FYoKWsymcLkvOAG94?= =?us-ascii?Q?YLLvzT+5L9u/a5akAFSe9Fpybtr9kPjixXetRq5MGqE2IMspXyJa510x8Spc?= =?us-ascii?Q?rU9u44o+Sx/5S0MIyAZl1E52mMYX9fQzidr4cL18CAi11zcGAzaW5K1BGtQp?= =?us-ascii?Q?ITr+nm4xZxgwZW8v2Hv0PoUoTTGJhBRvCpBU?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(36860700013)(82310400026)(1800799024)(7416014)(376014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:50:33.1503 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: fbed8f6e-7b8d-4be9-99f5-08ddf07964f4 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BL02EPF00029927.namprd02.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR12MB5832 Content-Type: text/plain; charset="utf-8" From: Kinsey Ho Refactor the existing MGLRU page table walking logic to make it resumable. Additionally, introduce two hooks into the MGLRU page table walk: accessed callback and flush callback. The accessed callback is called for each accessed page detected via the scanned accessed bit. The flush callback is called when the accessed callback reports an out of space error. This allows for processing pages in batches for efficiency. With a generalised page table walk, introduce a new scan function which repeatedly scans on the same young generation and does not add a new young generation. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie Signed-off-by: Bharata B Rao --- include/linux/mmzone.h | 5 ++ mm/internal.h | 4 + mm/vmscan.c | 176 ++++++++++++++++++++++++++++++----------- 3 files changed, 139 insertions(+), 46 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f7094babed10..4ad15490aff6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -533,6 +533,8 @@ struct lru_gen_mm_walk { unsigned long seq; /* the next address within an mm to scan */ unsigned long next_addr; + /* called for each accessed pte/pmd */ + int (*accessed_cb)(unsigned long pfn); /* to batch promoted pages */ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; /* to batch the mm stats */ @@ -540,6 +542,9 @@ struct lru_gen_mm_walk { /* total batched items */ int batched; int swappiness; + /* for the pmd under scanning */ + int nr_young_pte; + int nr_total_pte; bool force_scan; }; =20 diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..6c2c86abfde2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -548,6 +548,10 @@ static inline int user_proactive_reclaim(char *buf, return 0; } #endif +void set_task_reclaim_state(struct task_struct *task, + struct reclaim_state *rs); +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + int (*accessed_cb)(unsigned long), void (*flush_cb)(void)); =20 /* * in mm/rmap.c: diff --git a/mm/vmscan.c b/mm/vmscan.c index 7de11524a936..4146e17f90ae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -289,7 +289,7 @@ static int sc_swappiness(struct scan_control *sc, struc= t mem_cgroup *memcg) continue; \ else =20 -static void set_task_reclaim_state(struct task_struct *task, +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ @@ -3092,7 +3092,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *w= alk, struct mm_struct **ite =20 VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq); =20 - if (walk->seq <=3D mm_state->seq) + if (!walk->accessed_cb && walk->seq <=3D mm_state->seq) goto done; =20 if (!mm_state->head) @@ -3518,16 +3518,14 @@ static void walk_update_folio(struct lru_gen_mm_wal= k *walk, struct folio *folio, } } =20 -static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long = end, - struct mm_walk *args) +static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long e= nd, + struct mm_walk *args, bool *suitable) { - int i; + int i, err =3D 0; bool dirty; pte_t *pte; spinlock_t *ptl; unsigned long addr; - int total =3D 0; - int young =3D 0; struct folio *last =3D NULL; struct lru_gen_mm_walk *walk =3D args->private; struct mem_cgroup *memcg =3D lruvec_memcg(walk->lruvec); @@ -3537,17 +3535,21 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, pmd_t pmdval; =20 pte =3D pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval= , &ptl); - if (!pte) - return false; + if (!pte) { + *suitable =3D false; + return 0; + } =20 if (!spin_trylock(ptl)) { pte_unmap(pte); - return true; + *suitable =3D true; + return 0; } =20 if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) { pte_unmap_unlock(pte, ptl); - return false; + *suitable =3D false; + return 0; } =20 arch_enter_lazy_mmu_mode(); @@ -3557,7 +3559,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, struct folio *folio; pte_t ptent =3D ptep_get(pte + i); =20 - total++; + walk->nr_total_pte++; walk->mm_stats[MM_LEAF_TOTAL]++; =20 pfn =3D get_pte_pfn(ptent, args->vma, addr, pgdat); @@ -3581,23 +3583,34 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, if (pte_dirty(ptent)) dirty =3D true; =20 - young++; + walk->nr_young_pte++; walk->mm_stats[MM_LEAF_YOUNG]++; + + if (!walk->accessed_cb) + continue; + + err =3D walk->accessed_cb(pfn); + if (err) { + walk->next_addr =3D addr + PAGE_SIZE; + break; + } } =20 walk_update_folio(walk, last, gen, dirty); last =3D NULL; =20 - if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &= end)) + if (!err && i < PTRS_PER_PTE && + get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) goto restart; =20 arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte, ptl); =20 - return suitable_to_scan(total, young); + *suitable =3D suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte); + return err; } =20 -static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct v= m_area_struct *vma, +static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm= _area_struct *vma, struct mm_walk *args, unsigned long *bitmap, unsigned long *first) { int i; @@ -3610,6 +3623,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigne= d long addr, struct vm_area struct pglist_data *pgdat =3D lruvec_pgdat(walk->lruvec); DEFINE_MAX_SEQ(walk->lruvec); int gen =3D lru_gen_from_seq(max_seq); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3617,13 +3631,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsig= ned long addr, struct vm_area if (*first =3D=3D -1) { *first =3D addr; bitmap_zero(bitmap, MIN_LRU_BATCH); - return; + return 0; } =20 i =3D addr =3D=3D -1 ? 0 : pmd_index(addr) - pmd_index(*first); if (i && i <=3D MIN_LRU_BATCH) { __set_bit(i - 1, bitmap); - return; + return 0; } =20 pmd =3D pmd_offset(pud, *first); @@ -3673,6 +3687,16 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area dirty =3D true; =20 walk->mm_stats[MM_LEAF_YOUNG]++; + if (!walk->accessed_cb) + goto next; + + err =3D walk->accessed_cb(pfn); + if (err) { + i =3D find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1; + + walk->next_addr =3D (*first & PMD_MASK) + i * PMD_SIZE; + break; + } next: i =3D i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + = 1; } while (i <=3D MIN_LRU_BATCH); @@ -3683,9 +3707,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area spin_unlock(ptl); done: *first =3D -1; + return err; } =20 -static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long = end, +static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long e= nd, struct mm_walk *args) { int i; @@ -3697,6 +3722,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, unsigned long first =3D -1; struct lru_gen_mm_walk *walk =3D args->private; struct lru_gen_mm_state *mm_state =3D get_mm_state(walk->lruvec); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3710,6 +3736,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, /* walk_pte_range() may call get_next_vma() */ vma =3D args->vma; for (i =3D pmd_index(start), addr =3D start; addr !=3D end; i++, addr =3D= next) { + bool suitable; pmd_t val =3D pmdp_get_lockless(pmd + i); =20 next =3D pmd_addr_end(addr, end); @@ -3726,7 +3753,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long= start, unsigned long end, walk->mm_stats[MM_LEAF_TOTAL]++; =20 if (pfn !=3D -1) - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; continue; } =20 @@ -3735,33 +3765,50 @@ static void walk_pmd_range(pud_t *pud, unsigned lon= g start, unsigned long end, if (!pmd_young(val)) continue; =20 - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; } =20 if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i= )) continue; =20 + err =3D walk_pte_range(&val, addr, next, args, &suitable); + if (err && walk->next_addr < next && first =3D=3D -1) + return err; + + walk->nr_total_pte =3D 0; + walk->nr_young_pte =3D 0; + walk->mm_stats[MM_NONLEAF_FOUND]++; =20 - if (!walk_pte_range(&val, addr, next, args)) - continue; + if (!suitable) + goto next; =20 walk->mm_stats[MM_NONLEAF_ADDED]++; =20 /* carry over to the next generation */ update_bloom_filter(mm_state, walk->seq + 1, pmd + i); +next: + if (err) { + walk->next_addr =3D first; + return err; + } } =20 - walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); =20 - if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &e= nd)) + if (!err && i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &s= tart, &end)) goto restart; + + return err; } =20 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long e= nd, struct mm_walk *args) { - int i; + int i, err; pud_t *pud; unsigned long addr; unsigned long next; @@ -3779,7 +3826,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long s= tart, unsigned long end, if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) continue; =20 - walk_pmd_range(&val, addr, next, args); + err =3D walk_pmd_range(&val, addr, next, args); + if (err) + return err; =20 if (need_resched() || walk->batched >=3D MAX_LRU_BATCH) { end =3D (addr | ~PUD_MASK) + 1; @@ -3800,40 +3849,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long= start, unsigned long end, return -EAGAIN; } =20 -static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) { + int err; static const struct mm_walk_ops mm_walk_ops =3D { .test_walk =3D should_skip_vma, .p4d_entry =3D walk_pud_range, .walk_lock =3D PGWALK_RDLOCK, }; - int err; struct lruvec *lruvec =3D walk->lruvec; =20 - walk->next_addr =3D FIRST_USER_ADDRESS; + DEFINE_MAX_SEQ(lruvec); =20 - do { - DEFINE_MAX_SEQ(lruvec); + err =3D -EBUSY; =20 - err =3D -EBUSY; + /* another thread might have called inc_max_seq() */ + if (walk->seq !=3D max_seq) + return err; =20 - /* another thread might have called inc_max_seq() */ - if (walk->seq !=3D max_seq) - break; + /* the caller might be holding the lock for write */ + if (mmap_read_trylock(mm)) { + err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, + &mm_walk_ops, walk); =20 - /* the caller might be holding the lock for write */ - if (mmap_read_trylock(mm)) { - err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, w= alk); + mmap_read_unlock(mm); + } =20 - mmap_read_unlock(mm); - } + if (walk->batched) { + spin_lock_irq(&lruvec->lru_lock); + reset_batch_size(walk); + spin_unlock_irq(&lruvec->lru_lock); + } =20 - if (walk->batched) { - spin_lock_irq(&lruvec->lru_lock); - reset_batch_size(walk); - spin_unlock_irq(&lruvec->lru_lock); - } + return err; +} =20 +static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + int err; + + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); cond_resched(); } while (err =3D=3D -EAGAIN); } @@ -4045,6 +4102,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsig= ned long seq, int swappiness return success; } =20 +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + int (*accessed_cb)(unsigned long), void (*flush_cb)(void)) +{ + struct lru_gen_mm_walk *walk =3D current->reclaim_state->mm_walk; + struct mm_struct *mm =3D NULL; + + walk->lruvec =3D lruvec; + walk->seq =3D seq; + walk->accessed_cb =3D accessed_cb; + walk->swappiness =3D MAX_SWAPPINESS; + + do { + int err =3D -EBUSY; + + iterate_mm_list(walk, &mm); + if (!mm) + break; + + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); + cond_resched(); + flush_cb(); + } while (err =3D=3D -EAGAIN); + } while (mm); +} + static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness, bool force_scan) { --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM02-BN1-obe.outbound.protection.outlook.com (mail-bn1nam02on2048.outbound.protection.outlook.com [40.107.212.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E33C8327A01 for ; Wed, 10 Sep 2025 14:51:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.212.48 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515878; cv=fail; b=af1I+DkBniQuJoRcTnMTQGHT0jfWh2xfmukqa4uTvK/yQuCe+ICdGyr3TMSNDp2XRT3BdD43re4kT9EXAN/NlYq2p7Vwda5BkqS9v71Lohdm1pjapPSfYaq4A46OmRcNAbFZjFQMnzYRQiHMh3m+lHTC9geWdlAGKSad4nT6/Ng= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515878; c=relaxed/simple; bh=rrwiD48dh4O/i1XtJhgj2gQJeX8HFa9WnjFYl1hdoaE=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=KsHmbcT9zGgILdUGhzHpDdfA1lKVMJGU66YN5bHZW0opi3E+XMkC6Ok2PPSpx9PvAwhk0fo1UCFvmRJA4lyhCfk1aIndDxd0Y4xhmRv1L/bNKkKaYoFHGyVKa2iDRBT+OGFxKBQjeewNHU/bPJw/VbKPiHM9mvwfG0O71QM38ok= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=XF4grmu/; arc=fail smtp.client-ip=40.107.212.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="XF4grmu/" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=VYh3T6fxx5v+TGt0x3qOEeZjp32TbLZ3ce3NIt8pjAgOKN0u8r8sQIt+oeevp4kkj1c4/Bxrt6PFty3RWI5kj12tSy04fVZx2CDUSKgbx+A7FoA7Fx0d3QxrhP1/qFFmcadnN3U3r2E6SXWrn4NxJWqvOhieVA0Dz+BHLR0Rv7rUz7BCCd6tZGI/6uCoeSpZFrUNgp4jutcnJ3KrEeCWl03hPl4vOKyX0uMrNfPUK2eD1DC//BS9lNeOo4fGTyf5HpgMJDyiUnEcv2vz9DPpYg4cuPw3T0VeInIZ1hq8OsNVWNSPjaKRFxZnq6fIFiy8ckJr7502cYbaIFUtPGs1OA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=knSHZCfhdhbDF6/qJ3gVhvLKAI1MjwZVSMnVS3xXk2M=; b=Rsy3eaL3key1SzHtoZClKI3SPmWrAjFvM8XDvwJ9SBdVau1da4cBblSNNrOAtMCgz6YfhvZQ8oEP6jZ9rE7oEgttfAV1SBe2QIxlJms0Tqnby6JOdgtl3oixDinmjqUf4r42kGIrPCiFLixoLf2tTfNfF+bzVV2TGKFYsVDtd9mFt9qnmOGos4hOs2JDYlltiVZS8UH6Rp7YJDRcRchdQj2HNGHKSBWwnB66vhg0E+bOvpsRwHwhYlfLRGAx4mmzc24Af35cL5lEyQNeudK6Z4nJMk8vqv6A6ZkMIYukBIMa3PnoA+SY+q7bq5C0ysvjKUKVbkdoRE1lH1coD1uw0g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=knSHZCfhdhbDF6/qJ3gVhvLKAI1MjwZVSMnVS3xXk2M=; b=XF4grmu/xgr1ZsbVIok6DTWrvpLDUcEyPjJgKGM0/g2FKf2V2v53dGMJr34qhgYflMkaD8fXakN51ZQ8G67RoAj9ZbLpelDom1/x+ksaamHB4ADTqtDrt548AsdyH7wJk2krodAy0BuSmk68DPNPHjG2gAT6utPWGTaauGg7QH4= Received: from SA1PR02CA0004.namprd02.prod.outlook.com (2603:10b6:806:2cf::8) by BL4PR12MB9508.namprd12.prod.outlook.com (2603:10b6:208:58e::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:51:12 +0000 Received: from SA2PEPF0000150B.namprd04.prod.outlook.com (2603:10b6:806:2cf:cafe::22) by SA1PR02CA0004.outlook.office365.com (2603:10b6:806:2cf::8) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9115.15 via Frontend Transport; Wed, 10 Sep 2025 14:51:11 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SA2PEPF0000150B.mail.protection.outlook.com (10.167.242.43) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:51:11 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:50:58 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Date: Wed, 10 Sep 2025 20:16:52 +0530 Message-ID: <20250910144653.212066-8-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF0000150B:EE_|BL4PR12MB9508:EE_ X-MS-Office365-Filtering-Correlation-Id: 2cfb6722-07cb-4bb4-6db6-08ddf0797bf8 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|82310400026|376014|7416014|1800799024|36860700013; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?6qZKHOcMTVS6sOGB7TscvKQB9dkGMI+/4ZEo4mIqX1bm0jRkdbbLUpDh6eaq?= =?us-ascii?Q?SfklVTkos//orU8rGXAgJrf4Lx9T93vJ7W3ToEI9LdX6Lu8UMQDD7sOKPH5P?= =?us-ascii?Q?0/4CJDtfCjO1RWIs4R3FnARtwdXeS1qrfWT8Nck+EI8p5Oyh7cYJEQT5J26P?= =?us-ascii?Q?e362HYkbZIRU2KH4EcMyn/VYessndxc1k1E1S3vSSO/Yel0Lgbu3q62roavx?= =?us-ascii?Q?EhceuQROCVhgMzdRhNeQt/WLD9VJktHS+GKXHKCIRz1hIDqsv0+h/ayaMzNk?= =?us-ascii?Q?JXRj8rlE8LZJTXN7l0yhj+F6poxuj+/NHVJhoYX80ZmjMtNECelfmnR1eqld?= =?us-ascii?Q?zlKlWwS6PsNiVHy+fP2lnaU6xzc8Y047f70et5GYYlRH1cv1caSYpRPn4xw/?= =?us-ascii?Q?CyTJNGcpabT+b75x+haLYV6b1l8GLYcXVe7FfFgX0nJk8KQrcYcdR5Dcfi3o?= =?us-ascii?Q?Y7HYaXpQukYUpsaAVUPJQFCkbfJIZ72V2vOxNhUrHhD0hJSpGsuCw8byfcmH?= =?us-ascii?Q?+eQuTPwVNyhbdKd30a9kUlzgFwpkBUTfKPLP6G3I0MvEc/6Xw2+DMVeMUzFL?= =?us-ascii?Q?yv/Arg4JOM2lEBopMaKBiPm96LN+xQ/cnwVJDa10pSD3WXFd9On0JnAE0BPb?= =?us-ascii?Q?vx+N/uKLlz/BwyOa33+dtIvRbcXSuEb8i7phxk6X01OAcduVgU77VKeVipLY?= =?us-ascii?Q?G8HPOlnlA/PL+6fKFfls/JbczuiCrGIZmk2g+OUQyrkdRkyYgwUWTR6Ir5rr?= =?us-ascii?Q?CqesfHzJliM77uqENztXPi1WRB07cqP6FmrjW9YzrGQjLC+68pCzzG05UCTX?= =?us-ascii?Q?u4S82OU46C0d6ge8op6Q8BbKaqrfl/OEVgsjU3kLICaHLWqdis9wcOz5yqaG?= =?us-ascii?Q?K0wj1/ii9gVW/vxFdw+CvyjNMT1UGNclZIXCpnpl0sbQF+drX2hTe9rqH/CJ?= =?us-ascii?Q?eQZfYyt66DnePeYvksqDpdAFG/kj1mzY7kvdCF4CrwVaZs2iDRLsi7adDQ5H?= =?us-ascii?Q?6Rx7kAmcT3BE3UxQziycofGpaUPpms9nLJc1OWjt9/YU/QvAxPVSqrZpdbu2?= =?us-ascii?Q?Kcy2SgjW95WPUkYJMMvtSyykAE18s5qRbcsamfsTTqBqUpcwfPCelhNgM1PM?= =?us-ascii?Q?Q3gCYuVk9vOhB/ahiZr47uxf/KZ99ftBTnnYubRts0DRZwIG2caG3FhEOf1U?= =?us-ascii?Q?5KjURi3Y/WfeUoJoANNnbfYxizIN2Cy28/GLXbjnhqTSCRmKv13krTJRwGJw?= =?us-ascii?Q?U4VxRvU3nLjqej7KKvLQd1TGsmUju/C2lWtAhqQ/ZHt/kaxP0S9MZfHV4Lfl?= =?us-ascii?Q?DcmDZYNbjZ/mC6nynzcAqakv91k866/P7QmR1yj+4YlRYA5fe5oXC+66taKt?= =?us-ascii?Q?6C5eqN1Qfh+5b+1dzq8jATLQ9vNUpBDPfSqyCO68bqxdDzlamq+AZ7ie95z2?= =?us-ascii?Q?QR1MUblvdktTwpl+K1Ffayd2lxHXmQONJJnVOdfwruV9NTJEErU9KA/vCIC3?= =?us-ascii?Q?Pt65sFJlEorLKgRCRb9tyNJMpA/SlbjYD6SR?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(82310400026)(376014)(7416014)(1800799024)(36860700013);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:51:11.7401 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 2cfb6722-07cb-4bb4-6db6-08ddf0797bf8 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF0000150B.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL4PR12MB9508 Content-Type: text/plain; charset="utf-8" From: Kinsey Ho Introduce a new kernel daemon, klruscand, that periodically invokes the MGLRU page table walk. It leverages the new callbacks to gather access information and forwards it to the pghot hot page tracking sub-system for promotion decisions. This benefits from reusing the existing MGLRU page table walk infrastructure, which is optimized with features such as hierarchical scanning and bloom filters to reduce CPU overhead. As an additional optimization to be added in the future, we can tune the scan intervals for each memcg. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie Signed-off-by: Bharata B Rao [Reduced the scan interval to 100ms, pfn_t to unsigned long] --- mm/Kconfig | 8 ++++ mm/Makefile | 1 + mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+) create mode 100644 mm/klruscand.c diff --git a/mm/Kconfig b/mm/Kconfig index 8b236eb874cf..6d53c1208729 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1393,6 +1393,14 @@ config PGHOT by various sources. Asynchronous promotion is done by per-node kernel threads. =20 +config KLRUSCAND + bool "Kernel lower tier access scan daemon" + default y + depends on PGHOT && LRU_GEN_WALKS_MMU + help + Scan for accesses from lower tiers by invoking MGLRU to perform + page table walks. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index ecdd5241bea8..05a96ec35aa3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_PT_RECLAIM) +=3D pt_reclaim.o obj-$(CONFIG_PGHOT) +=3D pghot.o +obj-$(CONFIG_KLRUSCAND) +=3D klruscand.o diff --git a/mm/klruscand.c b/mm/klruscand.c new file mode 100644 index 000000000000..1a51aab29bd9 --- /dev/null +++ b/mm/klruscand.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +#define KLRUSCAND_INTERVAL_MS 100 +#define BATCH_SIZE (2 << 16) + +static struct task_struct *scan_thread; +static unsigned long pfn_batch[BATCH_SIZE]; +static int batch_index; + +static void flush_cb(void) +{ + int i =3D 0; + + for (; i < batch_index; i++) { + u64 pfn =3D pfn_batch[i]; + + pghot_record_access((unsigned long)pfn, NUMA_NO_NODE, + PGHOT_PGTABLE_SCAN, jiffies); + + if (i % 16 =3D=3D 0) + cond_resched(); + } + batch_index =3D 0; +} + +static int accessed_cb(unsigned long pfn) +{ + if (batch_index >=3D BATCH_SIZE) + return -EAGAIN; + + pfn_batch[batch_index++] =3D pfn; + return 0; +} + +static int klruscand_run(void *unused) +{ + struct lru_gen_mm_walk *walk; + + walk =3D kzalloc(sizeof(*walk), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (!walk) + return -ENOMEM; + + while (!kthread_should_stop()) { + unsigned long next_wake_time; + long sleep_time; + struct mem_cgroup *memcg; + int flags; + int nid; + + next_wake_time =3D jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS); + + for_each_node_state(nid, N_MEMORY) { + pg_data_t *pgdat =3D NODE_DATA(nid); + struct reclaim_state rs =3D { 0 }; + + if (node_is_toptier(nid)) + continue; + + rs.mm_walk =3D walk; + set_task_reclaim_state(current, &rs); + flags =3D memalloc_noreclaim_save(); + + memcg =3D mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec =3D + mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq =3D + READ_ONCE((lruvec)->lrugen.max_seq); + + lru_gen_scan_lruvec(lruvec, max_seq, + accessed_cb, flush_cb); + cond_resched(); + } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + memset(walk, 0, sizeof(*walk)); + } + + sleep_time =3D next_wake_time - jiffies; + if (sleep_time > 0 && sleep_time !=3D MAX_SCHEDULE_TIMEOUT) + schedule_timeout_idle(sleep_time); + } + kfree(walk); + return 0; +} + +static int __init klruscand_init(void) +{ + struct task_struct *task; + + task =3D kthread_run(klruscand_run, NULL, "klruscand"); + + if (IS_ERR(task)) { + pr_err("Failed to create klruscand kthread\n"); + return PTR_ERR(task); + } + + scan_thread =3D task; + return 0; +} +module_init(klruscand_init); --=20 2.34.1 From nobody Wed Sep 10 23:31:03 2025 Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-sn1nam02on2049.outbound.protection.outlook.com [40.107.96.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 45FC7298994 for ; Wed, 10 Sep 2025 14:51:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.96.49 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515911; cv=fail; b=QRKDom76NvKl4tXmQSQ0tNS3KwL2oXvcej4H8gCJt0q2KCmAFRSX69R+vgbX3vQ3awAuUx77vVkDoBYgZm6Rd0DzeCHC7MKvnRdIFDZO8TZrHkcKSE5+foblIsWc0rPD0Nc+gNXg1AbgLlmtk7EzK5t0NtF7ugLVTrKbEPDDaTQ= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757515911; c=relaxed/simple; bh=+vUNeFcS6G2U8ac4ec/lkMQpnRtMm/5yMPtpi6OGqHE=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=grqshW93EaQ3k1FNynTyljzfsLe6MYLJL9KbRiNNKOfqvWswChnKPApzlbAX5MjrT7fa96dyrkgZgVDrWw6miLx/Ks2EKQdwAaJugdh7IqB4tzDByFHEUIb4HQXN94A6nWqITzulEcnQ0vlOxRgUz/2JVwYX1ovQLqGpuuw3QXM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=AVgYDj/D; arc=fail smtp.client-ip=40.107.96.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="AVgYDj/D" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=hpgg8yiW0o9AvxvGbjsUssXuIstFMG4CaWKeVKlpRSxOBFKMwvRBUpG01FxG7szPU5ZP0sBLi+j18aL5rrjPBhHMn8MC17XAvugqDaYf64ObaL5NYxTxFHMKg5iMYdiQDOpr0KHaEnofHGVQkfKvBklBUWA/25UuQUNEzN+HDPQ8xG2Enq+lp6MkfJusmhZ55OU+YEG5pXImTt6+DPRP1d4s1WJD9GrPhvJoXooL1knoD3KdCWAUXgDDNf3K7Sq+auw18JEk4EHo8qq4/z+SFWZFMgEaMC9HZOnALfac64ctpb0THArCXtIZ+aUJOadesgg0D0EWenEC8t8/L6xWiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Jo0BVZS4QvRHXjRZwR9uesfzjUOpTBvJUQpBpLK1Syo=; b=nY3E8yVsP5BnFiG7ZSEcqxUJBEhsEfbhUO23jcKORYxkBdJodbmqghty05xei83r3vHw0ua9hA7zotDcqO0WRV08SnUzJEnaL14V0idGL9wPo3lDTTkYm/CZmMuvwgnqRmcQdHChu8v02EwnKWPq5Y5OUrGu+L+tGqtSkOlRkg/FN9zEvJvhjeHxmCcu17UsGe0OweyVOdMAGGBuBdC+Tdoy1PSfIu4EzCga/bxuWgCK72e2v5lEiMHExzYjYZW0YuV0YjDiEVo0+/RQD+yhE0nGYoVpWPKH7BN8usXP139wEq7Xe8b1dhoFsf/HiGasdoNc1b5+K2S83IxQHh58og== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=vger.kernel.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Jo0BVZS4QvRHXjRZwR9uesfzjUOpTBvJUQpBpLK1Syo=; b=AVgYDj/DgmgQZ/nof69aWDCJdZuMPMa+8Gr5dqhlDPygd8/aiyEuXUtZ4Q9v2ABxPPbufEcZPyIK1BuywL4vD/W+8kCREMCcUXPuhgMtbRUofYwjKUjfRVPy+LYUrecsgkrkOBZZHXDNdEh+ctliCCBU643n17XPMBuT00IiQHo= Received: from SA1PR04CA0021.namprd04.prod.outlook.com (2603:10b6:806:2ce::24) by BY5PR12MB4290.namprd12.prod.outlook.com (2603:10b6:a03:20e::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9094.22; Wed, 10 Sep 2025 14:51:45 +0000 Received: from SA2PEPF00001504.namprd04.prod.outlook.com (2603:10b6:806:2ce:cafe::39) by SA1PR04CA0021.outlook.office365.com (2603:10b6:806:2ce::24) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9094.23 via Frontend Transport; Wed, 10 Sep 2025 14:51:44 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by SA2PEPF00001504.mail.protection.outlook.com (10.167.242.36) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9115.13 via Frontend Transport; Wed, 10 Sep 2025 14:51:44 +0000 Received: from BLR-L-BHARARAO.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Wed, 10 Sep 2025 07:51:33 -0700 From: Bharata B Rao To: , CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , Bharata B Rao Subject: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Date: Wed, 10 Sep 2025 20:16:53 +0530 Message-ID: <20250910144653.212066-9-bharata@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250910144653.212066-1-bharata@amd.com> References: <20250910144653.212066-1-bharata@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF00001504:EE_|BY5PR12MB4290:EE_ X-MS-Office365-Filtering-Correlation-Id: 800f0eeb-2c4b-4985-b7d4-08ddf0798fa4 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|36860700013|376014|82310400026|7416014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?fyI+GY6ldRDEcDfoHr7s/qWMoMU0oj8aMjaU5i7MdhPyKSzZ6oFIlaX7FQN8?= =?us-ascii?Q?0val73d97aWdXUQOTQXuBQzklZTlOOo68oIjhzzPlNiCmADIqP69MJKJQfH6?= =?us-ascii?Q?Y9G6BXaMcF1YMPoCvqNg5W1GnAb0jhPmyBAd/ATocivJgNWveUZYSKGkHgjF?= =?us-ascii?Q?mZdRY5iAWdoWLE/G6Apn3jizaDjT9agRro5w4bYF5JPerq/v/xbeaThuTXdw?= =?us-ascii?Q?IzR5INOaV5KhA/XnZ37Sf3IDjOsGfIBM9FLOWlXx1zdtNbxnLhcaYMxLNQKG?= =?us-ascii?Q?59aXY5D8GHr1pRQjWNTDJl5RwerikPdg+dkLGW4x0veKbkLXzD0LRLiFYR1e?= =?us-ascii?Q?9jnyK9UW4LolrqJoyR4iPsMtSU2Ipn89x+Y2kCjnWIOACS029PPS34QXgWcA?= =?us-ascii?Q?yYskPrCe6pXmDF25a0QY0/TyeT4k0wsIO/qbj5VtWLLeGpHTMM2JaiNrQf7f?= =?us-ascii?Q?VXViJ0qes02ZQ/u9684dOflJrWckDjwpnZfD4Hzj/TbOutcDyY3PbX6jRXag?= =?us-ascii?Q?cIKhrzEVlNwOE80bt1mmbKn5fPvY6WLzDrovo0NehIBRiUq4DQI/o0r/Y0Qr?= =?us-ascii?Q?SdJNaZwETySVAn65HlpHvjTPdtQ3GDxQGdx0cFzTMyfTLdrv7z05inZTrWr7?= =?us-ascii?Q?juZLeGXYKaU6agoMqp0Q28LQJ0TcbuOXh1KOhIE2dGgzpBQmBwEae6iV87Po?= =?us-ascii?Q?61Nke7c2BZxefm/oMYOa3p4jJw6dOlGtOyj9VnaNQvU2Hn9QF5IngPwQZVDc?= =?us-ascii?Q?mEuSA/jr1fJCHftHn3iBfg+wMADyiSYinG/8v1I5bvELq/P6/OaYoMj7kR/l?= =?us-ascii?Q?91Kdev5C+8pIY93JLhAGXGLhReRfSBZReOX4chv76wK2KMmsdkumFEDREh6G?= =?us-ascii?Q?lhbRO2w8lcllnefodIWB9ThJheYn2Rtgr4p552E6bA2ndjYlDQuppBQmP4by?= =?us-ascii?Q?uGKSIhcC5/dZw1IEFwuXi0XuI0YGi9piwR69eyOx+uyljk6Okxbxin9miu0F?= =?us-ascii?Q?AhNFgmdoU3DRxlwXlJ80gWAs+joR+MSmlaZWcQx9LeOTyxCvHPoSryR+Nzux?= =?us-ascii?Q?LcwyvBVZ2rYTg+/zvOP5TzwiZ6dX5ipIWnEyWnPRCDM6+vIHGPfhZgDHLyBU?= =?us-ascii?Q?+XrHCEsYhYn0S1+OYgAj0bVJ9wEGb07UKtnWDKiRW67ZhYM1E99GjelIBTxf?= =?us-ascii?Q?fZ4/4Y72xYcrGsQbumgPr5C/kG+9NtIpaO49TkSkjdLDH+yAw5SetqPPUDhF?= =?us-ascii?Q?TDMsHyMwV1xewQ9Uk6Jh8YJcAzbnrt1tnBUW0gtXCNHuQPgeYYaTyGVt3OrT?= =?us-ascii?Q?AiMBqDABc+4TenM6aZKbtazMgNHeOkq2aQjB/WYkmgjuD1UNcFa+rxdQrI82?= =?us-ascii?Q?H3EVmXlebpJNwTQb8BlbA7SRTQ7wQmVd1qqcqDDoZ27kRnJhuiiSSWDfKWcW?= =?us-ascii?Q?TwpIwqiiJy26R44lDUfTj9El9TbYohl72U5CH0sOh3vmq/BrxVrSpjRVOfSe?= =?us-ascii?Q?zyNp8zFasrwitWsjIMY+w12ZSTrBGbYHYyfq?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(1800799024)(36860700013)(376014)(82310400026)(7416014);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Sep 2025 14:51:44.7393 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 800f0eeb-2c4b-4985-b7d4-08ddf0798fa4 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF00001504.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB4290 Content-Type: text/plain; charset="utf-8" Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING mode of NUMA Balancing) does hot page detection (via hint faults), hot page classification and eventual promotion, all by itself and sits within the scheduler. With the new hot page tracking and promotion mechanism being available, NUMA Balancing can limit itself to detection of hot pages (via hint faults) and off-load rest of the functionality to the common hot page tracking system. pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the hot page info. In addition, the migration rate limiting and dynamic threshold logic are moved to kpromoted so that the same can be used for hot pages reported by other sources too. Signed-off-by: Bharata B Rao --- include/linux/pghot.h | 2 + kernel/sched/fair.c | 149 ++---------------------------------------- mm/memory.c | 32 ++------- mm/pghot.c | 132 +++++++++++++++++++++++++++++++++++-- 4 files changed, 142 insertions(+), 173 deletions(-) diff --git a/include/linux/pghot.h b/include/linux/pghot.h index 1443643aab13..98a72e01bdd6 100644 --- a/include/linux/pghot.h +++ b/include/linux/pghot.h @@ -47,6 +47,8 @@ enum pghot_src { #define PGHOT_HEAP_PCT 25 =20 #define KPROMOTED_MIGRATE_BATCH 1024 +#define KPROMOTED_MIGRATION_ADJUST_STEPS 16 +#define KPROMOTED_PROMOTION_THRESHOLD_WINDOW 60000 =20 /* * If target NID isn't available, kpromoted promotes to node 0 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b173a059315c..54eeddb6ec23 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu) static unsigned int sysctl_sched_cfs_bandwidth_slice =3D 5000UL; #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ -static unsigned int sysctl_numa_balancing_promote_rate_limit =3D 65536; -#endif - #ifdef CONFIG_SYSCTL static const struct ctl_table sched_fair_sysctls[] =3D { #ifdef CONFIG_CFS_BANDWIDTH @@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] =3D= { .extra1 =3D SYSCTL_ONE, }, #endif -#ifdef CONFIG_NUMA_BALANCING - { - .procname =3D "numa_balancing_promote_rate_limit_MBps", - .data =3D &sysctl_numa_balancing_promote_rate_limit, - .maxlen =3D sizeof(unsigned int), - .mode =3D 0644, - .proc_handler =3D proc_dointvec_minmax, - .extra1 =3D SYSCTL_ZERO, - }, -#endif /* CONFIG_NUMA_BALANCING */ }; =20 static int __init sched_fair_sysctl_init(void) @@ -1800,108 +1785,6 @@ static inline bool cpupid_valid(int cpupid) return cpupid_to_cpu(cpupid) < nr_cpu_ids; } =20 -/* - * For memory tiering mode, if there are enough free pages (more than - * enough watermark defined here) in fast memory node, to take full - * advantage of fast memory capacity, all recently accessed slow - * memory pages will be migrated to fast memory node without - * considering hot threshold. - */ -static bool pgdat_free_space_enough(struct pglist_data *pgdat) -{ - int z; - unsigned long enough_wmark; - - enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, - pgdat->node_present_pages >> 4); - for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { - struct zone *zone =3D pgdat->node_zones + z; - - if (!populated_zone(zone)) - continue; - - if (zone_watermark_ok(zone, 0, - promo_wmark_pages(zone) + enough_wmark, - ZONE_MOVABLE, 0)) - return true; - } - return false; -} - -/* - * For memory tiering mode, when page tables are scanned, the scan - * time will be recorded in struct page in addition to make page - * PROT_NONE for slow memory page. So when the page is accessed, in - * hint page fault handler, the hint page fault latency is calculated - * via, - * - * hint page fault latency =3D hint page fault time - scan time - * - * The smaller the hint page fault latency, the higher the possibility - * for the page to be hot. - */ -static int numa_hint_fault_latency(struct folio *folio) -{ - int last_time, time; - - time =3D jiffies_to_msecs(jiffies); - last_time =3D folio_xchg_access_time(folio, time); - - return (time - last_time) & PAGE_ACCESS_TIME_MASK; -} - -/* - * For memory tiering mode, too high promotion/demotion throughput may - * hurt application latency. So we provide a mechanism to rate limit - * the number of pages that are tried to be promoted. - */ -static bool numa_promotion_rate_limit(struct pglist_data *pgdat, - unsigned long rate_limit, int nr) -{ - unsigned long nr_cand; - unsigned int now, start; - - now =3D jiffies_to_msecs(jiffies); - mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - start =3D pgdat->nbp_rl_start; - if (now - start > MSEC_PER_SEC && - cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) - pgdat->nbp_rl_nr_cand =3D nr_cand; - if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) - return true; - return false; -} - -#define NUMA_MIGRATION_ADJUST_STEPS 16 - -static void numa_promotion_adjust_threshold(struct pglist_data *pgdat, - unsigned long rate_limit, - unsigned int ref_th) -{ - unsigned int now, start, th_period, unit_th, th; - unsigned long nr_cand, ref_cand, diff_cand; - - now =3D jiffies_to_msecs(jiffies); - th_period =3D sysctl_numa_balancing_scan_period_max; - start =3D pgdat->nbp_th_start; - if (now - start > th_period && - cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { - ref_cand =3D rate_limit * - sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC; - nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); - diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; - unit_th =3D ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS; - th =3D pgdat->nbp_threshold ? : ref_th; - if (diff_cand > ref_cand * 11 / 10) - th =3D max(th - unit_th, unit_th); - else if (diff_cand < ref_cand * 9 / 10) - th =3D min(th + unit_th, ref_th * 2); - pgdat->nbp_th_nr_cand =3D nr_cand; - pgdat->nbp_threshold =3D th; - } -} - bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int src_nid, int dst_cpu) { @@ -1917,33 +1800,11 @@ bool should_numa_migrate_memory(struct task_struct = *p, struct folio *folio, =20 /* * The pages in slow memory node should be migrated according - * to hot/cold instead of private/shared. - */ - if (folio_use_access_time(folio)) { - struct pglist_data *pgdat; - unsigned long rate_limit; - unsigned int latency, th, def_th; - - pgdat =3D NODE_DATA(dst_nid); - if (pgdat_free_space_enough(pgdat)) { - /* workload changed, reset hot threshold */ - pgdat->nbp_threshold =3D 0; - return true; - } - - def_th =3D sysctl_numa_balancing_hot_threshold; - rate_limit =3D sysctl_numa_balancing_promote_rate_limit << \ - (20 - PAGE_SHIFT); - numa_promotion_adjust_threshold(pgdat, rate_limit, def_th); - - th =3D pgdat->nbp_threshold ? : def_th; - latency =3D numa_hint_fault_latency(folio); - if (latency >=3D th) - return false; - - return !numa_promotion_rate_limit(pgdat, rate_limit, - folio_nr_pages(folio)); - } + * to hot/cold instead of private/shared. Also the migration + * of such pages are handled by kpromoted. + */ + if (folio_use_access_time(folio)) + return true; =20 this_cpupid =3D cpu_pid_to_cpupid(dst_cpu, current->pid); last_cpupid =3D folio_xchg_last_cpupid(folio, this_cpupid); diff --git a/mm/memory.c b/mm/memory.c index 0ba4f6b71847..eeb34e8d9b8e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -75,6 +75,7 @@ #include #include #include +#include =20 #include =20 @@ -5864,34 +5865,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) =20 target_nid =3D numa_migrate_check(folio, vmf, vmf->address, &flags, writable, &last_cpupid); + nid =3D target_nid; if (target_nid =3D=3D NUMA_NO_NODE) goto out_map; - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) { - flags |=3D TNF_MIGRATE_FAIL; - goto out_map; - } - /* The folio is isolated and isolation code holds a folio reference. */ - pte_unmap_unlock(vmf->pte, vmf->ptl); + writable =3D false; ignore_writable =3D true; - - /* Migrate to the requested node */ - if (!migrate_misplaced_folio(folio, target_nid)) { - nid =3D target_nid; - flags |=3D TNF_MIGRATED; - task_numa_fault(last_cpupid, nid, nr_pages, flags); - return 0; - } - - flags |=3D TNF_MIGRATE_FAIL; - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, - vmf->address, &vmf->ptl); - if (unlikely(!vmf->pte)) - return 0; - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } out_map: /* * Make it present again, depending on how arch implements @@ -5905,8 +5884,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) writable); pte_unmap_unlock(vmf->pte, vmf->ptl); =20 - if (nid !=3D NUMA_NO_NODE) + if (nid !=3D NUMA_NO_NODE) { + pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, + jiffies); task_numa_fault(last_cpupid, nid, nr_pages, flags); + } return 0; } =20 diff --git a/mm/pghot.c b/mm/pghot.c index 9f7581818b8f..9f5746892bce 100644 --- a/mm/pghot.c +++ b/mm/pghot.c @@ -9,6 +9,9 @@ * * kpromoted is a kernel thread that runs on each toptier node and * promotes pages from max_heap. + * + * Migration rate-limiting and dynamic threshold logic implementations + * were moved from NUMA Balancing mode 2. */ #include #include @@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init; =20 static unsigned int sysctl_pghot_freq_window =3D KPROMOTED_FREQ_WINDOW; =20 +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +static unsigned int sysctl_pghot_promote_rate_limit =3D 65536; + #ifdef CONFIG_SYSCTL static const struct ctl_table pghot_sysctls[] =3D { { @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] =3D { .proc_handler =3D proc_dointvec_minmax, .extra1 =3D SYSCTL_ZERO, }, + { + .procname =3D "pghot_promote_rate_limit_MBps", + .data =3D &sysctl_pghot_promote_rate_limit, + .maxlen =3D sizeof(unsigned int), + .mode =3D 0644, + .proc_handler =3D proc_dointvec_minmax, + .extra1 =3D SYSCTL_ZERO, + }, }; #endif + static bool phi_heap_less(const void *lhs, const void *rhs, void *args) { return (*(struct pghot_info **)lhs)->frequency > @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap,= struct pghot_info *phi) return true; } =20 +/* + * For memory tiering mode, if there are enough free pages (more than + * enough watermark defined here) in fast memory node, to take full + * advantage of fast memory capacity, all recently accessed slow + * memory pages will be migrated to fast memory node without + * considering hot threshold. + */ +static bool pgdat_free_space_enough(struct pglist_data *pgdat) +{ + int z; + unsigned long enough_wmark; + + enough_wmark =3D max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, + pgdat->node_present_pages >> 4); + for (z =3D pgdat->nr_zones - 1; z >=3D 0; z--) { + struct zone *zone =3D pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone_watermark_ok(zone, 0, + promo_wmark_pages(zone) + enough_wmark, + ZONE_MOVABLE, 0)) + return true; + } + return false; +} + +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool kpromoted_promotion_rate_limit(struct pglist_data *pgdat, + unsigned long rate_limit, int nr, + unsigned long time) +{ + unsigned long nr_cand; + unsigned int now, start; + + now =3D jiffies_to_msecs(time); + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start =3D pgdat->nbp_rl_start; + if (now - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now) =3D=3D start) + pgdat->nbp_rl_nr_cand =3D nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >=3D rate_limit) + return true; + return false; +} + +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat, + unsigned long rate_limit, + unsigned int ref_th, + unsigned long now) +{ + unsigned int start, th_period, unit_th, th; + unsigned long nr_cand, ref_cand, diff_cand; + + now =3D jiffies_to_msecs(now); + th_period =3D KPROMOTED_PROMOTION_THRESHOLD_WINDOW; + start =3D pgdat->nbp_th_start; + if (now - start > th_period && + cmpxchg(&pgdat->nbp_th_start, start, now) =3D=3D start) { + ref_cand =3D rate_limit * + KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC; + nr_cand =3D node_page_state(pgdat, PGPROMOTE_CANDIDATE); + diff_cand =3D nr_cand - pgdat->nbp_th_nr_cand; + unit_th =3D ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS; + th =3D pgdat->nbp_threshold ? : ref_th; + if (diff_cand > ref_cand * 11 / 10) + th =3D max(th - unit_th, unit_th); + else if (diff_cand < ref_cand * 9 / 10) + th =3D min(th + unit_th, ref_th * 2); + pgdat->nbp_th_nr_cand =3D nr_cand; + pgdat->nbp_threshold =3D th; + } +} + +static inline unsigned int pghot_access_latency(struct pghot_info *phi, u3= 2 now) +{ + return (now - phi->last_update); +} + static bool phi_is_pfn_hot(struct pghot_info *phi) { struct page *page =3D pfn_to_online_page(phi->pfn); - unsigned long now =3D jiffies; struct folio *folio; + struct pglist_data *pgdat; + unsigned long rate_limit; + unsigned int latency, th, def_th; + unsigned long now =3D jiffies; =20 if (!page || is_zone_device_page(page)) return false; @@ -113,7 +216,24 @@ static bool phi_is_pfn_hot(struct pghot_info *phi) return false; } =20 - return true; + pgdat =3D NODE_DATA(phi->nid); + if (pgdat_free_space_enough(pgdat)) { + /* workload changed, reset hot threshold */ + pgdat->nbp_threshold =3D 0; + return true; + } + + def_th =3D sysctl_pghot_freq_window; + rate_limit =3D sysctl_pghot_promote_rate_limit << (20 - PAGE_SHIFT); + kpromoted_promotion_adjust_threshold(pgdat, rate_limit, def_th, now); + + th =3D pgdat->nbp_threshold ? : def_th; + latency =3D pghot_access_latency(phi, now & PGHOT_TIME_MASK); + if (latency >=3D th) + return false; + + return !kpromoted_promotion_rate_limit(pgdat, rate_limit, + folio_nr_pages(folio), now); } =20 static struct folio *kpromoted_isolate_folio(struct pghot_info *phi) @@ -351,9 +471,13 @@ int pghot_record_access(u64 pfn, int nid, int src, uns= igned long now) /* * If the previous access was beyond the threshold window * start frequency tracking afresh. + * + * Bypass the new window logic for NUMA hint fault source + * as it is too slow in reporting accesses. + * TODO: Fix this. */ - if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_w= indow)) || - (nid !=3D NUMA_NO_NODE && phi->nid !=3D nid)) + if ((((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_= window)) + && (src !=3D PGHOT_HINT_FAULT)) || (nid !=3D NUMA_NO_NODE && phi->nid= !=3D nid)) new_window =3D true; =20 if (new_entry || new_window) { --=20 2.34.1