From nobody Fri Dec 19 12:44:53 2025 Received: from localhost.localdomain (unknown [147.136.157.0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BBCC298CC7; Wed, 5 Nov 2025 11:42:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=147.136.157.0 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762342968; cv=none; b=taKV1ZkSdmH0VPJ5+DicDNZuLKLLHBq3ju2UQHV0pyKxb9bhDjileJYsgL3OYR5ZkhsU2Y1bQbOxwerdyaesgHs649C8+0pgDx5dNy+mgGh1qUeyRDZfl8L4C3NAOWECQVwr7ciKoGgJrynFDYE1+tASiMtlTvmBRSv22f26Jng= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762342968; c=relaxed/simple; bh=pchtt63Oh3F2HNObMqsCdLZ8QHoH/Yw2B2nuwJg8ogw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=S0o046JchdM5x3ezOXBEr9kUHRoYtqMQs7wxbTe9lCunD8/KjsfPpiXEpleTR/e8tM55snW40/wumZ4FdHXzKyP5Yr7Oni/493uIOqKQ5SXpkSq19TNtwofGSVyX7BnxJU93elN9hHIf9UQWyMfkjhAEGtGG6aXIzBOqXhsXxcQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev; spf=none smtp.mailfrom=localhost.localdomain; arc=none smtp.client-ip=147.136.157.0 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=localhost.localdomain Received: by localhost.localdomain (Postfix, from userid 1007) id 961088B2A0F; Wed, 5 Nov 2025 19:42:44 +0800 (+08) From: Jiayuan Chen To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: Jiayuan Chen , Jiayuan Chen , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Dietmar Eggemann , Ben Segall , Mel Gorman , Valentin Schneider , Andrii Nakryiko , Gabriele Monaco , Oleg Nesterov , Ricardo Neri , Libo Chen , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: [RESEND PATCH v1] sched/numa: Add tracepoint to track NUMA migration cost Date: Wed, 5 Nov 2025 19:42:15 +0800 Message-ID: <20251105114220.150467-1-jiayuan.chen@linux.dev> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Jiayuan Chen In systems with multiple NUMA nodes, memory imbalance between nodes often occurs. To address this, we typically tune parameters like scan_size_mb or scan_period_{min,max}_ms to allow processes to migrate pages between NUMA nodes. Currently, the migration task task_numa_work() holds the mmap_lock during the entire migration process, which can significantly impact process performance, especially for memory operations. This patch introduces a new tracepoint that records the migration duration, along with the number of scanned pages and migrated pages. These metrics can be used to calculate efficiency metrics similar to %vmeff in 'sar -B'. These metrics help evaluate whether the adjusted NUMA balancing parameters are properly tuned. Here's an example bpftrace script: ```bash bpftrace -e ' tracepoint:sched:sched_numa_balance_start { @start_time[cpu] =3D nsecs; } tracepoint:sched:sched_numa_balance_end { if (@start_time[cpu] > 0) { $cost =3D nsecs - @start_time[cpu]; printf("task '%s' migrate cost %lu, scanned %lu, migrated %lu\n", args.comm, $cost, args.scanned, args.migrated); } } ' ``` Sample output: Attaching 2 probes... task 'rs:main Q:Reg' migrate cost 5584655, scanned 24516, migrated 22373 task 'systemd-journal' migrate cost 123191, scanned 6308, migrated 0 task 'wrk' migrate cost 894026, scanned 5842, migrated 5841 Signed-off-by: Jiayuan Chen --- include/trace/events/sched.h | 60 ++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 14 +++++++-- 2 files changed, 72 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 7b2645b50e78..e24bf700a614 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -804,6 +804,66 @@ TRACE_EVENT(sched_skip_cpuset_numa, __entry->ngid, MAX_NUMNODES, __entry->mem_allowed) ); + +TRACE_EVENT(sched_numa_balance_start, + + TP_PROTO(struct task_struct *tsk), + + TP_ARGS(tsk), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(pid_t, tgid) + __field(pid_t, ngid) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid =3D task_pid_nr(tsk); + __entry->tgid =3D task_tgid_nr(tsk); + __entry->ngid =3D task_numa_group_id(tsk); + ), + + TP_printk("comm=3D%s pid=3D%d tgid=3D%d ngid=3D%d", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid) +); + +TRACE_EVENT(sched_numa_balance_end, + + TP_PROTO(struct task_struct *tsk, unsigned long scanned, unsigned long mi= grated), + + TP_ARGS(tsk, scanned, migrated), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(pid_t, tgid) + __field(pid_t, ngid) + __field(unsigned long, migrated) + __field(unsigned long, scanned) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid =3D task_pid_nr(tsk); + __entry->tgid =3D task_tgid_nr(tsk); + __entry->ngid =3D task_numa_group_id(tsk); + __entry->migrated =3D migrated; + __entry->scanned =3D scanned; + ), + + TP_printk("comm=3D%s pid=3D%d tgid=3D%d ngid=3D%d scanned=3D%lu migrated= =3D%lu", + __entry->comm, + __entry->pid, + __entry->tgid, + __entry->ngid, + __entry->scanned, + __entry->migrated) +); #endif /* CONFIG_NUMA_BALANCING */ =20 /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 25970dbbb279..173c9c8397e2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3294,6 +3294,9 @@ static void task_numa_work(struct callback_head *work) struct vm_area_struct *vma; unsigned long start, end; unsigned long nr_pte_updates =3D 0; + unsigned long nr_scanned =3D 0; + unsigned long total_migrated =3D 0; + unsigned long total_scanned =3D 0; long pages, virtpages; struct vma_iterator vmi; bool vma_pids_skipped; @@ -3359,6 +3362,7 @@ static void task_numa_work(struct callback_head *work) if (!mmap_read_trylock(mm)) return; =20 + trace_sched_numa_balance_start(p); /* * VMAs are skipped if the current PID has not trapped a fault within * the VMA recently. Allow scanning to be forced if there is no @@ -3477,6 +3481,10 @@ static void task_numa_work(struct callback_head *wor= k) end =3D min(end, vma->vm_end); nr_pte_updates =3D change_prot_numa(vma, start, end); =20 + nr_scanned =3D (end - start) >> PAGE_SHIFT; + total_migrated +=3D nr_pte_updates; + total_scanned +=3D nr_scanned; + /* * Try to scan sysctl_numa_balancing_size worth of * hpages that have at least one present PTE that @@ -3486,8 +3494,8 @@ static void task_numa_work(struct callback_head *work) * areas faster. */ if (nr_pte_updates) - pages -=3D (end - start) >> PAGE_SHIFT; - virtpages -=3D (end - start) >> PAGE_SHIFT; + pages -=3D nr_scanned; + virtpages -=3D nr_scanned; =20 start =3D end; if (pages <=3D 0 || virtpages <=3D 0) @@ -3528,6 +3536,8 @@ static void task_numa_work(struct callback_head *work) mm->numa_scan_offset =3D start; else reset_ptenuma_scan(p); + + trace_sched_numa_balance_end(p, total_scanned, total_migrated); mmap_read_unlock(mm); =20 /* --=20 2.43.0