From nobody Sun Feb 8 07:07:33 2026 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B49C817D2; Thu, 24 Apr 2025 00:02:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745452938; cv=none; b=hmASyx1t9u5e3s4qqAMbtKSR/aSZEfL57ZnTfDzAZfgOvvfYKKngealQHWaDe+XW770OZXyqzLp2t0oh42/xB28a+ipvgOboKqdgS3PylVpMrtYvQtL4oqCX1HI7B9y1XEgL8lv3I4cIYuAR3NbMhvp0IVqL4ZNt/qIi4BDKDfw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745452938; c=relaxed/simple; bh=NYUWSLy6QpMvRrzAdRxpnbaZep79ljUxsyBdwOU4Jwo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GJueuKfLuZJCs4puxEHYwDWwkJOj7K2UKF6gXfoSqw0wQtp+ipg3700TycKLk++VkeIhLTYAiwXmJKUuQBVC0t19eHeg/TFV94H2kaWtD+GMiGLbqf0AIegYsdYCxUUQb0zCaWUGmezgaULJFhbVY1i27355w6ToJ1Eqx1u32t4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=NQAR9p/2; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="NQAR9p/2" Received: from pps.filterd (m0333521.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53NLMLH2017449; Thu, 24 Apr 2025 00:01:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=pokXB CX5Ded9Lnz+unOijz8gr7rZOR1qtecYySzC12Y=; b=NQAR9p/2W/KkC+/L86oEl gp2W1r13YQ5gUWtRnqdGRHu/EP4RnbmQcs/w4RbsnlgFTMUe7OcFHWJIpW6wd1TY zS6WIuNE4JhdpFNN+5GL11vuqnCeWkn94gm7q8ybtRzRfYO17fmKLO6EFfKsmpk/ y/kMP+QLxjHID9Lkkx/rE/r6N808PTNxf5DMebIh3Aazf5nbGDbT+37tAVzuI7xW tEALVMCnOL8NPWicfjB8svv4dxoYIB/7rxlM6hYziDaWzmvngOeeG+e7aa7Itp74 DF5tc5fUBu+asxuQb0cz1ui6IWYFBAngdFSVrCR0rMYuHouFesnqwsKSTB6iaAUV w== Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.appoci.oracle.com [147.154.114.232]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 466jh9jj2u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 00:01:54 +0000 (GMT) Received: from pps.filterd (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 53NMSOma030859; Thu, 24 Apr 2025 00:01:53 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 466k06jr0y-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 00:01:53 +0000 Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 53O01oOm004345; Thu, 24 Apr 2025 00:01:52 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 466k06jqsh-2; Thu, 24 Apr 2025 00:01:52 +0000 From: Libo Chen To: akpm@linux-foundation.org, rostedt@goodmis.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, tj@kernel.org, llong@redhat.com Cc: sraithal@amd.com, venkat88@linux.ibm.com, kprateek.nayak@amd.com, raghavendra.kt@amd.com, yu.c.chen@intel.com, tim.c.chen@intel.com, vineethr@linux.ibm.com, chris.hyser@oracle.com, daniel.m.jordan@oracle.com, lorenzo.stoakes@oracle.com, mkoutny@suse.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Date: Wed, 23 Apr 2025 17:01:45 -0700 Message-ID: <20250424000146.1197285-2-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250424000146.1197285-1-libo.chen@oracle.com> References: <20250424000146.1197285-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.680,FMLib:17.12.80.40 definitions=2025-04-23_12,2025-04-22_01,2025-02-21_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 bulkscore=0 mlxscore=0 spamscore=0 malwarescore=0 mlxlogscore=999 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2504070000 definitions=main-2504230161 X-Proofpoint-GUID: QJa0f7b4SHhahwcbDcCBKNa-1K40sAWG X-Proofpoint-ORIG-GUID: QJa0f7b4SHhahwcbDcCBKNa-1K40sAWG X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDIzMDE2MSBTYWx0ZWRfXwDMISqLfLQot 5Y4msvxMFdbe+59/FcmyfBMPDCvmg5uq0tg8SQg5FHLqn66fO3y+IHLv61qH2ws58bGLHZJXMnv 71OKxolwbYBQzvmyFOLL+HRlTpJeZ2XCbsn37E/TwiZcgJ1IltDfks8jJKl/GTzrz/r2/40otO5 igpd3KO+8JdVLSadOQoXBO4MYj0bYx+xungzI2Zxa7myTmqgsePSlvsRmW7/7ZHoaavcZqlVDnM kKL+zEasxUu2YiEK7CmoQeXkSnSNtXdTVbHgiK+tOA+PlJdn/baI4OxnTICJA2ZSZcyU/BrI7bO f3sPjdbuGQnmEjTN4ZRfyUaNb5fd3fLCh5iLXazMj18A6X96sZl4wIw/OT3USn+NHH7r80kLCMA QKeKJUVB Content-Type: text/plain; charset="utf-8" When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Signed-off-by: Libo Chen Tested-by: Chen Yu --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e43993a4e580..c9903b1b3948 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *wor= k) if (p->flags & PF_EXITING) return; =20 + /* + * Memory is pinned to only one NUMA node via cpuset.mems, naturally + * no page can be migrated. + */ + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) + return; + if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); --=20 2.43.5 From nobody Sun Feb 8 07:07:33 2026 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 818EE286A9; Thu, 24 Apr 2025 00:02:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.177.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745452943; cv=none; b=Mn81cynaG06FfJRzQLi/WY3MxYzW/Bv0AiCfQRHoKAqRMeag4YJ5KIE0k6hsDaJ7zHmaVMmj4E4sKxhBwUkOv4vIIAy5WqVACTHh2spq0RyOh2ugJENQhX0YTTVx1hjwf8wm1Vbd1cBVos/sdpvlPOAgMIgzg0mTzWFIbw+stTo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745452943; c=relaxed/simple; bh=DDZJyhCnuUQJLiomKa3FeAmx5tYOfkRvlXwy93sDAFM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gMJxTY9txKKOija7+VbMFLwpHXxONJHGjwCnpGfAbw4YGCt/fThNjiQUEAVlfkhCoE8RWrkCJDLr3RcChYWRVqgo+FfXR0qf+LfHIx/tZdNpMv7yies/oNNT01n/x+k4TL+wUk5k5OQBfJ8k/51pF5micCr5w4RtCeP1IrBofbg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=k/F9SLzn; arc=none smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="k/F9SLzn" Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53NLN4ie014961; Thu, 24 Apr 2025 00:01:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=Go1o3 ngvTVVw3IQXesJpexykobr2NIWMz+GMtxtdK+c=; b=k/F9SLznWOw43H6O1jFo7 JPbV56wanzM2gBsR4Sw6Bi0McNPxSV3l6WGZyOUp7xpgZIhemOizM+pt3cy39Q3K 11ngVzhCyR+KYF2XtHuNR5L09/DJFZxsWo0ADimOnnY6srhiVX9DSSXmzWHFpi6o pF2UNEI/rJQl8j3Bdg9GQ8MsjbT+3izCqvB24cmcyLiUH7ziv0iRqgmiZd7PhgVr 9G9s7r/XOhldKEwf48C9o9Bjm4y7NA5xhU5AdvRQLhfiru9J3gvrjxO65b8VL4+E 7xAvzWJcT7lAGG6iM/CYtLrwEhJQe25aIbghyMrfPNQYdkUgGgWdud2kGmM65QeR Q== Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.appoci.oracle.com [147.154.114.232]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 466jhdjhws-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 00:01:57 +0000 (GMT) Received: from pps.filterd (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 53NLx2xS030989; Thu, 24 Apr 2025 00:01:56 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 466k06jr40-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 00:01:56 +0000 Received: from phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 53O01oOo004345; Thu, 24 Apr 2025 00:01:55 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by phxpaimrmta02.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 466k06jqsh-3; Thu, 24 Apr 2025 00:01:55 +0000 From: Libo Chen To: akpm@linux-foundation.org, rostedt@goodmis.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, tj@kernel.org, llong@redhat.com Cc: sraithal@amd.com, venkat88@linux.ibm.com, kprateek.nayak@amd.com, raghavendra.kt@amd.com, yu.c.chen@intel.com, tim.c.chen@intel.com, vineethr@linux.ibm.com, chris.hyser@oracle.com, daniel.m.jordan@oracle.com, lorenzo.stoakes@oracle.com, mkoutny@suse.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Date: Wed, 23 Apr 2025 17:01:46 -0700 Message-ID: <20250424000146.1197285-3-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250424000146.1197285-1-libo.chen@oracle.com> References: <20250424000146.1197285-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.680,FMLib:17.12.80.40 definitions=2025-04-23_12,2025-04-22_01,2025-02-21_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 bulkscore=0 mlxscore=0 spamscore=0 malwarescore=0 mlxlogscore=999 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2504070000 definitions=main-2504230161 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDIzMDE2MSBTYWx0ZWRfX3K39cFddEaSw J5oyY6Kvwthg3Sm5joJzJ9MVxLsPp7WXSPS8c0gUbzNAvwxLKmgAbb5AwaOnv2nV1H7+GXkJ6ZA knU22nKaGz2Dj5tfv6Dit5jNKkVPFe4/WNpWz05B5/R5gjaSrXZE5tRDElt2bhdY5X6K/8/BBxo tCpOeKX7x5fWKHVwW5od+FUNyGwUsnv8MVdoeVolX+aPIyH4Bk+yTgRWFeDbEA8rnFLrqt8AyNi xaFDSYaAwy95yipbI1Xdm/F1kSmtPfTv5PW4zJo9rpbpJUEvezR8iCZH5jpN0sZ3wsebHJXq7FG F5FFJgVksVxKWJShryZd3kaduB3MMW2UkZ6CyxwAlWUVpGzlsheklbPoPuzGDO3Udb7z0cNHo5Y nlhIDoql X-Proofpoint-ORIG-GUID: opZWXMmBppisdzyOststAw_YWX0HnQeg X-Proofpoint-GUID: opZWXMmBppisdzyOststAw_YWX0HnQeg Content-Type: text/plain; charset="utf-8" Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Signed-off-by: Libo Chen --- include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 +++- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 8994e97d86c1..91f9dc177dad 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -745,6 +745,37 @@ TRACE_EVENT(sched_skip_vma_numa, __entry->vm_end, __print_symbolic(__entry->reason, NUMAB_SKIP_REASON)) ); + +TRACE_EVENT(sched_skip_cpuset_numa, + + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr), + + TP_ARGS(tsk, mem_allowed_ptr), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES)) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid =3D task_pid_nr(tsk); + __entry->tgid =3D task_tgid_nr(tsk); + __entry->ngid =3D task_numa_group_id(tsk); + memcpy(__entry->mem_allowed, mem_allowed_ptr->bits, + sizeof(__entry->mem_allowed)); + ), + + TP_printk("comm=3D%s pid=3D%d tgid=3D%d ngid=3D%d mem_nodes_allowed=3D%*p= bl", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid, + MAX_NUMNODES, __entry->mem_allowed) +); #endif /* CONFIG_NUMA_BALANCING */ =20 /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9903b1b3948..cc892961ce15 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *wor= k) * Memory is pinned to only one NUMA node via cpuset.mems, naturally * no page can be migrated. */ - if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) { + trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed); return; + } =20 if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + --=20 2.43.5