From nobody Wed Dec 17 07:08:04 2025 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 548DFA55 for ; Thu, 27 Mar 2025 00:24:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743035053; cv=none; b=if4fVOjw69uEo99mElqPSA+uFmJr1ZbrCBdOAIRpeePRVGtbiayU3RYKHBGxAvSWCpmYqEIjJjZUI0XCbjnkP/0S7zOe5dXm9XcgJ83LS1RMnzOBv6B3phXXakipJSLIKu/TiYZVO2GmaR/BKPYfOkaWv8My3RntyTddQ5v9zPA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743035053; c=relaxed/simple; bh=hBuiaaAx+f/L53Q11MOG6jdETYZX61xauWm5FUxuyNM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YkxVQV3qmw28Po8nZ5H0l6uQtw7PNMeqvvp22vN1PpQqL6KWsxIW3EV93GMioP2L8r35PAjhma7EBkEUjWlRt63Q5hhsV2Q6TyKWShRqQQk1E35DaTNDM84lv0JwWWG2vv7mIJedP2erl3Bmm4J7j+I4pos9vdekpneqLFD75X4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=BFbcutSa; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="BFbcutSa" Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52QK0ZXq005118; Thu, 27 Mar 2025 00:23:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=4ySad hY0Db63ZVl7XxfS6wGZg2aPzOj50UsYHKE/zQk=; b=BFbcutSaK5KyVhn0Iuw74 6fTiGaotzaK/ePf/ONr8QNepIEI415xQoqGCMo7KBVhjewCVzQwjxc5jhBh33qxD yEU+eMDN+5e1pZRZ0Ixr49DFdBYYPKBuLWoGJPKcEPGxz5B/aKB3Mfl/EMuC8ILA 36FIz6RCBIPMX+f9NNqgMhXS32shyIerq6s8rIMyixRPt56TJdT/dPU40GtSq/k2 +UyirFS3+uPy+c72WIMrOk2PjzGWMUy1NPY1yfLlvmOVc7T0KYkfqmMzuHwHez6b 1tM29mvbixil01Gcz4ibRvgUgwT+gO0n8Ylrx6pPv4dCX82Qs+dINagaAL7BJPrD g== Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.appoci.oracle.com [130.35.100.223]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 45hnrsk1q5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Mar 2025 00:23:54 +0000 (GMT) Received: from pps.filterd (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 52QMDTGk016011; Thu, 27 Mar 2025 00:23:53 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTP id 45jj9465pr-2; Thu, 27 Mar 2025 00:23:53 +0000 From: Libo Chen To: peterz@infradead.org, mgorman@techsingularity.net, longman@redhat.com Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, tj@kernel.org Subject: [PATCH v2 1/2] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Date: Wed, 26 Mar 2025 17:23:51 -0700 Message-ID: <20250327002352.203332-2-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250327002352.203332-1-libo.chen@oracle.com> References: <20250327002352.203332-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1095,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-26_09,2025-03-26_02,2024-11-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 adultscore=0 mlxlogscore=999 mlxscore=0 phishscore=0 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2502280000 definitions=main-2503270001 X-Proofpoint-GUID: S9FAdRiuyzbzDqXzqmAhbGqov1U13U7g X-Proofpoint-ORIG-GUID: S9FAdRiuyzbzDqXzqmAhbGqov1U13U7g Content-Type: text/plain; charset="utf-8" When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Signed-off-by: Libo Chen --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e43993a4e5807..6f405e00c9c7e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *wor= k) if (p->flags & PF_EXITING) return; =20 + /* + * Memory is pinned to only one NUMA node via cpuset.mems, naturally + * no page can be migrated. + */ + if (nodes_weight(cpuset_current_mems_allowed) =3D=3D 1) + return; + if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); --=20 2.43.5 From nobody Wed Dec 17 07:08:04 2025 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77AED2F36 for ; Thu, 27 Mar 2025 00:24:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.177.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743035055; cv=none; b=URsZBhBYg4bDi5KNX/rOpb6xyetRIgpLELz+Wsq8GATI1qAvt71/WM2nHd50Jsvs5MD/lcMCZIDf+peSWYN6CXbcUd2Rvk5KuXk1UrohJx6lSoN2fHKU4nds6BenKypcJy6+PVdBV8MTC1apnQfpZEYox/zKpyybFYSfsB7YXmQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1743035055; c=relaxed/simple; bh=IzHSEmd7S9fD67nh85jYveW/tK7fAailSrpNPVzEJhk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=AYSGA1OrFalwnk0SCFzC/jsYVTl/FB9QKz3LQIPfH5eTvmMiE+neg0kuiKflHi8GbTbBraCw9Pk9BU4cG0HwYNDIAbsT61xDyORWtbb3zNypGPFG+DgxKy2V1DvLjBES4HskYWSPKf6rjq3gRYLIyF1SOuXe897FuvTLLfzWqZI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=QBiepHa9; arc=none smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="QBiepHa9" Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 52QK0elJ014324; Thu, 27 Mar 2025 00:23:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2023-11-20; bh=CVyNx r0bt0K2eZ05OM2w2Ki82TKLtr4rkpKoWELND90=; b=QBiepHa9uanJdOF/4ytlO 4hGsqTTzuqi/Ou1bd0YDq2YxrdKHvtIYzj9MvuGgJFel8j4yDNjozIUUWvmOGZfh HeEqTVbleO1k6R0MQ8SBBw89NMymdVuv8Jv7ebJhdK6uGEhwfUZI2ZX4w9Mew8XN AzEf0bm7gwAHiNgAi5604jX/eNvm4dS0nIsmO7xuxVAKVc5Rz9w2+kIOdSzFVrnA T1VmggGNnYt6vf9FtxZKspXnUpMVG+jtjnwGYxmi5hmU5cJNg9rjxxOcXi5tQuSO TjsnUglFV1xIOu1skMisSzJ+FrEFkZ85YXcpKysfz82bTVWpyz8v33YRxC6D+Oyz A== Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.appoci.oracle.com [130.35.100.223]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 45hn5mb2su-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 27 Mar 2025 00:23:54 +0000 (GMT) Received: from pps.filterd (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 52QMDTGl016011; Thu, 27 Mar 2025 00:23:53 GMT Received: from clb-2-bm-ad2.osdevelopmeniad.oraclevcn.com (clb-2-bm-ad2.allregionaliads.osdevelopmeniad.oraclevcn.com [100.100.254.172]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTP id 45jj9465pr-3; Thu, 27 Mar 2025 00:23:53 +0000 From: Libo Chen To: peterz@infradead.org, mgorman@techsingularity.net, longman@redhat.com Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, tj@kernel.org Subject: [PATCH v2 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning Date: Wed, 26 Mar 2025 17:23:52 -0700 Message-ID: <20250327002352.203332-3-libo.chen@oracle.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250327002352.203332-1-libo.chen@oracle.com> References: <20250327002352.203332-1-libo.chen@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1095,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-03-26_09,2025-03-26_02,2024-11-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 spamscore=0 adultscore=0 mlxlogscore=999 mlxscore=0 phishscore=0 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2502280000 definitions=main-2503270001 X-Proofpoint-GUID: RRE9nYTgaYdYEKeQdTlj1YMDapZfUrq_ X-Proofpoint-ORIG-GUID: RRE9nYTgaYdYEKeQdTlj1YMDapZfUrq_ Content-Type: text/plain; charset="utf-8" Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. --- include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 +++- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index bfd97cce40a1a..133d9a671734a 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -745,6 +745,37 @@ TRACE_EVENT(sched_skip_vma_numa, __entry->vm_end, __print_symbolic(__entry->reason, NUMAB_SKIP_REASON)) ); + +TRACE_EVENT(sched_skip_cpuset_numa, + + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr), + + TP_ARGS(tsk, mem_allowed_ptr), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( pid_t, tgid ) + __field( pid_t, ngid ) + __array( unsigned long, mem_allowed, BITS_TO_LONGS(MAX_NUMNODES)) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid =3D task_pid_nr(tsk); + __entry->tgid =3D task_tgid_nr(tsk); + __entry->ngid =3D task_numa_group_id(tsk); + memcpy(__entry->mem_allowed, mem_allowed_ptr->bits, + sizeof(__entry->mem_allowed)); + ), + + TP_printk("comm=3D%s pid=3D%d tgid=3D%d ngid=3D%d mem_node_allowed_mask= =3D%lx", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid, + __entry->mem_allowed[0]) +); #endif /* CONFIG_NUMA_BALANCING */ =20 /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6f405e00c9c7e..a98842a96eda0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *wor= k) * Memory is pinned to only one NUMA node via cpuset.mems, naturally * no page can be migrated. */ - if (nodes_weight(cpuset_current_mems_allowed) =3D=3D 1) + if (nodes_weight(cpuset_current_mems_allowed) =3D=3D 1) { + trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed); return; + } =20 if (!mm->numa_next_scan) { mm->numa_next_scan =3D now + --=20 2.43.5