From nobody Sun Feb 8 06:00:03 2026 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26CC81A5BA0; Thu, 24 Apr 2025 02:46:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.177.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745462787; cv=none; b=MKMvhoG3nLkumdexIifGIszhcJMelxJ4DAJN/F3ia6AkCbYJNN9mAFrfIhTb2q+0h5HPlXx5BD7KAhwKUv4IqT8nZFqX2umPChyXRNMCPt9FQdiqabfBOJlXxpS9Ud/HEyPnAzK4AsitrfFRhOfksSKPrSPaZ4MMgrpT+JRfz7M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745462787; c=relaxed/simple; bh=NYUWSLy6QpMvRrzAdRxpnbaZep79ljUxsyBdwOU4Jwo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=t3UetK2hCsHsgR22N/61bFI6sSrFHVPZXhwhZOTT7GUdNT4LVyy7ra12iJWDU95RHCJcqju0G6fPe1tivzGKBh45q50CJFE/F+CSNJpkOLbrIzVuZxGbpikob6n2RmZRsBv3kEJBFoE3WCB/DrppBCcF3H9n5dV9o2v+HsPcd/s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=cHOotX4i; arc=none smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="cHOotX4i" Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53NLNBPF014981; Thu, 24 Apr 2025 02:45:31 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=pokXB CX5Ded9Lnz+unOijz8gr7rZOR1qtecYySzC12Y=; b=cHOotX4i+/1nJPgKroPJ3 xb48u1cYS3R5NeZ+XxzCBNBoMJJEoWF/A4W9nt0/oSDILU2cWPkuW80T0HuNj3GN 8+fdIK7sBm4rhSzU43wPeI4EUtd+AXxTUa3g8BItYp0yRv9+MHMegW2cpDCnKYEF q44/zoKjhoWSlb0UnQ74jt8b8yyUVuIeBGroABXHCBhzlSfS+2L+24zhe8GR0Cbp Dia/lVpb8KKBN5VHVCl868B1G3BxBhWKA2KvuorhmGYIVYSWSpvyCmjX8AuyDbae iNHACVxIpvxljYrzAHzMIeyUHrdoLK/zZQMB73lmu8xVTb5zyAcG61l1c1EkHdiv A== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 466jhdjuxv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 02:45:30 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 53O1USYA028195; Thu, 24 Apr 2025 02:45:30 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 466jx6xs0j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 02:45:29 +0000 Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 53O2jQBH033048; Thu, 24 Apr 2025 02:45:29 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 466jx6xrxs-2; Thu, 24 Apr 2025 02:45:29 +0000 From: Libo Chen To: akpm@linux-foundation.org, rostedt@goodmis.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, tj@kernel.org, llong@redhat.com Cc: sraithal@amd.com, venkat88@linux.ibm.com, kprateek.nayak@amd.com, raghavendra.kt@amd.com, yu.c.chen@intel.com, tim.c.chen@intel.com, vineethr@linux.ibm.com, chris.hyser@oracle.com, daniel.m.jordan@oracle.com, lorenzo.stoakes@oracle.com, mkoutny@suse.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Date: Wed, 23 Apr 2025 19:45:22 -0700 Message-ID: <20250424024523.2298272-2-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250424024523.2298272-1-libo.chen@oracle.com> References: <20250424024523.2298272-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.680,FMLib:17.12.80.40 definitions=2025-04-24_01,2025-04-22_01,2025-02-21_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 malwarescore=0 phishscore=0 suspectscore=0 mlxlogscore=999 adultscore=0 mlxscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2504070000 definitions=main-2504240015 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDI0MDAxNSBTYWx0ZWRfX/7I4T7Soeja9 2p3LGOEagnXQCIiGZaHZXOHwRdjoTYWCt2hQGQ7C9/efi+15rPylAPUVP3fMf7X+rtAy/YtlOtr CgPd1zJ4UtcghHMAJwgZRNnW2224VviyjAPro/5x6TLP+LHrY9Oby/RrYqmenkA+i3I+Q8Mk/Xk 6DukNzxROzM21a9hF5lKTblMRFmI8pSUlcnYxvJooydjEylwp21EL+a2KHyLKLW8b+BqaVg33vT NR0P6HAQCcqI0r88r8fPiGLEnncOv/hCdh0i2r3ojGTZq1nhanc7/905xhvKdLJS8AdN9ITaI9q aOVxIq0O6sQmteHgBYxom0l+rFpYWucgkhED+4DtMTPNtOmxtQ0Ph08doH8cYrHgwbRPQmK69U/ kcxBSfHB X-Proofpoint-ORIG-GUID: CIIDRKJUiTeV3GCHxneGef8f3ZwrMuXL X-Proofpoint-GUID: CIIDRKJUiTeV3GCHxneGef8f3ZwrMuXL Content-Type: text/plain; charset="utf-8" When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Signed-off-by: Libo Chen Tested-by: Chen Yu Tested-by: K Prateek Nayak Tested-by: Srikanth Aithal Tested-by: Venkat Rao Bagalkote --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e43993a4e580..c9903b1b3948 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *wor= k) if (p->flags & PF_EXITING) return; =20 + /* + * Memory is pinned to only one NUMA node via cpuset.mems, naturally + * no page can be migrated. + */ + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) + return; + if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); --=20 2.43.5 From nobody Sun Feb 8 06:00:03 2026 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 626CB19CD16; Thu, 24 Apr 2025 02:46:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745462785; cv=none; b=kdn0PGwNzKvmOD1ilnQMr1aIQ5eldzZO+B148xN+sk68wyr6soW4ZX/CyEZ3vwFca7k2UWgypYJvbvhVqg8Ey0PnBV9ynjfLLrku0RahxzD7WW8wrQmtzEFQK44gBU3C59A6h+lnM9Y0PPyTjkd4l0dTV8NYl59jwrD20RAK+j4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745462785; c=relaxed/simple; bh=Cogpddh8rHZuYXSXacYjBxtq1MmshXR/AYf1FJv6fbo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=B9rEQ/84oEc5MI0j/wSVrIPQhFURCW23fxzqY7ne/QuicCej53QLVAK9VvqtkHLNPy1QaZpnSLSAYJ+kb9Nmnxrp9feY7UUkWHFZXAGQkXQKjMWPlW3obvI7BrLBXYgJ+yxsuyLMj0FPpM6L9V+6+z99vicYvxdZDQZKcfZaRJY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=AT40zU61; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="AT40zU61" Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 53NLNA0C018126; Thu, 24 Apr 2025 02:45:33 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=Jfi9W scH0qLUANjDVKHfTqmq+k7m7mKeWTnbq5WqGO8=; b=AT40zU610csFqG+AzOG2D p8gCLX9z6uShmEYlLRJJ7P/h+pg1oACEXNoValFx0lVdhL897NCl2HB7a7Rdg4Ks rxNmxJFIvzS5WeUF0Pjm2i1/OQdoClzsCmnhs7tlaPIny0Nd1ZWG91no6Y33OXYR 01QRxTFa8VtmpUq7EfrbC2DofXK/9lMJ1EDdB1zu5y6RVjq2n3NhM4LCBNEf/jsA hQNiWuCobwMYJsn66FCL8I7KN4dkw4HIncQ6fTzhO2RZzHMDB8JfpJ/oSqrGe+nZ vyTd+zkkx9JgqxLZdlsjp67z601AQoINOqKeKOBWkjWsaUbVgV3W/xLDKlFaVumP Q== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 466jkjjx76-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 02:45:33 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 53O0PRJO028392; Thu, 24 Apr 2025 02:45:32 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 466jx6xs1m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 24 Apr 2025 02:45:32 +0000 Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 53O2jQBJ033048; Thu, 24 Apr 2025 02:45:31 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 466jx6xrxs-3; Thu, 24 Apr 2025 02:45:31 +0000 From: Libo Chen To: akpm@linux-foundation.org, rostedt@goodmis.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, tj@kernel.org, llong@redhat.com Cc: sraithal@amd.com, venkat88@linux.ibm.com, kprateek.nayak@amd.com, raghavendra.kt@amd.com, yu.c.chen@intel.com, tim.c.chen@intel.com, vineethr@linux.ibm.com, chris.hyser@oracle.com, daniel.m.jordan@oracle.com, lorenzo.stoakes@oracle.com, mkoutny@suse.com, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Date: Wed, 23 Apr 2025 19:45:23 -0700 Message-ID: <20250424024523.2298272-3-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250424024523.2298272-1-libo.chen@oracle.com> References: <20250424024523.2298272-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1099,Hydra:6.0.680,FMLib:17.12.80.40 definitions=2025-04-24_01,2025-04-22_01,2025-02-21_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 malwarescore=0 phishscore=0 suspectscore=0 mlxlogscore=999 adultscore=0 mlxscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2504070000 definitions=main-2504240015 X-Proofpoint-GUID: YrgbO1E1Q_jUOnOXG6PcmF1dYKwo32ZK X-Proofpoint-ORIG-GUID: YrgbO1E1Q_jUOnOXG6PcmF1dYKwo32ZK X-Proofpoint-Spam-Details-Enc: AW1haW4tMjUwNDI0MDAxNSBTYWx0ZWRfX8dIR9yw8bpy9 c0LC0QHXzTrdODkEDOAKnqPISHngM5Fsg4AHxbfBQl6TF60xJqJInowMf4R6VyB2f07NYwmXNj4 ncH34QhqenxIDY/YF8ACC5CdE2bRAm2WOu2EU857V7uy7ENICejPwsjMsWlsyH9W0HEX2+sltkH wapMiAoWX+bhNvRDSVIbamcyoWAdPbLIDlqEONuo1zpR/fXYUVeFKk2vZ1HObbvy0/EdlFF6Mck dlGLxmGUR/iwloRqZOz6hvMyeVdHj8Hf8zrDP9sSR88NTwzXlx9EErA+kfr+3YVo45p0ZxkCSGB sX1CdCrOMgMUNArI2Q58htmmQyb458EyOWqhI8kIBC0hEcz/p035ToUSXBoS0A7vSKqz6p49+1w 0XbuUUqs Content-Type: text/plain; charset="utf-8" Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Signed-off-by: Libo Chen Tested-by: Chen Yu Tested-by: K Prateek Nayak Tested-by: Srikanth Aithal Tested-by: Venkat Rao Bagalkote --- include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 +++- 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 8994e97d86c1..ff3990318aec 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -745,6 +745,39 @@ TRACE_EVENT(sched_skip_vma_numa, __entry->vm_end, __print_symbolic(__entry->reason, NUMAB_SKIP_REASON)) ); + +TRACE_EVENT(sched_skip_cpuset_numa, + + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr), + + TP_ARGS(tsk, mem_allowed_ptr), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES)) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid =3D task_pid_nr(tsk); + __entry->tgid =3D task_tgid_nr(tsk); + __entry->ngid =3D task_numa_group_id(tsk); + BUILD_BUG_ON(sizeof(nodemask_t) !=3D \ + BITS_TO_LONGS(MAX_NUMNODES) * sizeof(long)); + memcpy(__entry->mem_allowed, mem_allowed_ptr->bits, + sizeof(__entry->mem_allowed)); + ), + + TP_printk("comm=3D%s pid=3D%d tgid=3D%d ngid=3D%d mem_nodes_allowed=3D%*p= bl", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid, + MAX_NUMNODES, __entry->mem_allowed) +); #endif /* CONFIG_NUMA_BALANCING */ =20 /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9903b1b3948..cc892961ce15 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *wor= k) * Memory is pinned to only one NUMA node via cpuset.mems, naturally * no page can be migrated. */ - if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) =3D=3D= 1) { + trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed); return; + } =20 if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + --=20 2.43.5