From nobody Wed Oct 1 22:33:18 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88FE4301715 for ; Mon, 29 Sep 2025 11:42:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146165; cv=none; b=kOHCWKkdug4/DSVYTi6QbZBkrBX2iaFBeOLPGBTNxMEfbgymcp3N0/ab354z0GIebRc/IeHMuTFVSCOF/QJfF0MD0oesfGSn6E+ltVUoIvS5iN2Shijcf+6bvmv26lhPkBpxxy0ypi3aJo2BMHogMiuHsCOojvdzhkbP4cvpUW0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146165; c=relaxed/simple; bh=qr/trcihkG+XrW9xDQRtoLcrPWFzaIqNEFnGLgE07M8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=S2evARRLZKYAcHvT/FSaJ89KBdEfwL5LIYsP0bix0MQPPF1pCexNI6ByzvCs+c/+YR0o79mvAuGBGunJOlX13Smse34BxwRFZr+Vt/dQEIBquJ871KuXwYkTJP94e6hqnd2598K64D+vGYgsNEGhJefOtgZqx+rNh6cud0C8IzU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=c78aQZ6R; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="c78aQZ6R" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759146162; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Q7CDCxarxuDZf7gpcfd/wrLBO5nbTQ99uYCOatcpUvM=; b=c78aQZ6R3yig9Bw8UWgSdDSnDCATmXQoMQr8yY7DbT7pt++kT32CD+/LLwoc/fT3FQINNn mZe7AIDhQjewakhqwWCWw9ig2WB8f9/DF+DSxW6vzS/rgWLTaR1w+nM/UfXRWarHWvNN05 P3vcNdK6eqd2kMRpKwhqjTumnTNoO70= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-684-MfkUXi85OjOTAJLLwQbfAQ-1; Mon, 29 Sep 2025 07:42:39 -0400 X-MC-Unique: MfkUXi85OjOTAJLLwQbfAQ-1 X-Mimecast-MFC-AGG-ID: MfkUXi85OjOTAJLLwQbfAQ_1759146158 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A13AA1800370; Mon, 29 Sep 2025 11:42:37 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.csb (unknown [10.44.32.41]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C227E19560A2; Mon, 29 Sep 2025 11:42:33 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , sched-ext@lists.linux.dev Cc: Gabriele Monaco Subject: [PATCH v3 1/4] sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes Date: Mon, 29 Sep 2025 13:42:22 +0200 Message-ID: <20250929114225.36172-2-gmonaco@redhat.com> In-Reply-To: <20250929114225.36172-1-gmonaco@redhat.com> References: <20250929114225.36172-1-gmonaco@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Content-Type: text/plain; charset="utf-8" The fair scheduling class relies on prev_sum_exec_runtime to compute the duration of the task's runtime since it was last scheduled. This value is currently not required by other scheduling classes but can be useful to understand long running tasks and take certain actions (e.g. during a scheduler tick). Add support for prev_sum_exec_runtime to the RT, deadline and sched_ext classes by simply assigning the sum_exec_runtime at each set_next_task. Reviewed-by: Mathieu Desnoyers Signed-off-by: Gabriele Monaco --- kernel/sched/deadline.c | 1 + kernel/sched/ext.c | 1 + kernel/sched/rt.c | 1 + 3 files changed, 3 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index e2d51f4306b3..212d6bf5a732 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2342,6 +2342,7 @@ static void set_next_task_dl(struct rq *rq, struct ta= sk_struct *p, bool first) p->se.exec_start =3D rq_clock_task(rq); if (on_dl_rq(&p->dl)) update_stats_wait_end_dl(dl_rq, dl_se); + p->se.prev_sum_exec_runtime =3D p->se.sum_exec_runtime; =20 /* You can't push away the running task */ dequeue_pushable_dl_task(rq, p); diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7dedc9a16281..7c2d23e6d0df 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3257,6 +3257,7 @@ static void set_next_task_scx(struct rq *rq, struct t= ask_struct *p, bool first) } =20 p->se.exec_start =3D rq_clock_task(rq); + p->se.prev_sum_exec_runtime =3D p->se.sum_exec_runtime; =20 /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 7936d4333731..8c713d74672a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1644,6 +1644,7 @@ static inline void set_next_task_rt(struct rq *rq, st= ruct task_struct *p, bool f p->se.exec_start =3D rq_clock_task(rq); if (on_rt_rq(&p->rt)) update_stats_wait_end_rt(rt_rq, rt_se); + p->se.prev_sum_exec_runtime =3D p->se.sum_exec_runtime; =20 /* The running task is never eligible for pushing */ dequeue_pushable_task(rq, p); --=20 2.51.0 From nobody Wed Oct 1 22:33:18 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5ECDE3016F4 for ; Mon, 29 Sep 2025 11:42:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146172; cv=none; b=MN/wUDpictXM/WhtUjLmKiP2rKI35sdjd/jY2q9JlF67i51ATDYXii50z8J7y+idGvfvjJCaWRagFwnuWQ8dinyNsNFMgI/7mvHdnnxjrnmZce9IdcMZZSGgUQhanN7GDiO+LqQ7gC69r99a6O4YiHrOpp5/Asu0DNj3+/tc7Jg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146172; c=relaxed/simple; bh=NVkJuyFvVSph3d5MHnsgHxK+tQEIOzZq5ZOGLTj93MA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XSOCYr8NdsnAjT1Re15M06cSjj3qqr+y3NRLFRskaAYd4hxn3kpPEYryjEtW0Dk2z9krh1pq5UPN75Xc25kU+6r4w8qasHOCJRh/wZx9YyBODzzKu2OyF+G4pPy17jYho3IVI6cWhQdmNZKw8bSC63YIYx0DEysf9EvfLhvFpUQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=CzvAlmj0; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="CzvAlmj0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759146169; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+m5epJ6K/whUJPxjsxB575rCI33qv3BELyaIusqhK1A=; b=CzvAlmj0bBc2vROw6zSrmqsR8SJFFiUFIrnhI6emA2/i06ywD/UCgeWkcTli8s/xVoXdoF MfdW42CTUB2jsG2J6mFt8ji5lYviv/hERZLnhR2N0sEM3DKiwHiIP6zyhQvMtViyCutGM/ SEGdX+ZtNSvA+RNBfqFxNTyhJVwlEX4= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-144-4TJ6sOxrNB-cmljfbUtAYg-1; Mon, 29 Sep 2025 07:42:44 -0400 X-MC-Unique: 4TJ6sOxrNB-cmljfbUtAYg-1 X-Mimecast-MFC-AGG-ID: 4TJ6sOxrNB-cmljfbUtAYg_1759146163 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 21BA5180047F; Mon, 29 Sep 2025 11:42:43 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.csb (unknown [10.44.32.41]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C5C1B19560B4; Mon, 29 Sep 2025 11:42:38 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Andrew Morton , David Hildenbrand , linux-mm@kvack.org Cc: Gabriele Monaco Subject: [PATCH v3 2/4] rseq: Schedule the mm_cid_compaction from rseq_sched_switch_event() Date: Mon, 29 Sep 2025 13:42:23 +0200 Message-ID: <20250929114225.36172-3-gmonaco@redhat.com> In-Reply-To: <20250929114225.36172-1-gmonaco@redhat.com> References: <20250929114225.36172-1-gmonaco@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Content-Type: text/plain; charset="utf-8" Currently the mm_cid_compaction is triggered by the scheduler tick and runs in a task_work, behaviour is more unpredictable with periodic tasks with short runtime, which may rarely run during a tick. Schedule the mm_cid_compaction from the rseq_sched_switch_event() call only if the scan is required, that is when the pseudo-period of 100ms elapsed. Keep a tick handler used for long running tasks that are never preempted (i.e. that never call rseq_sched_switch_event), which triggers a compaction and mm_cid update only in that case. Signed-off-by: Gabriele Monaco --- include/linux/mm_types.h | 11 +++++++++ include/linux/rseq.h | 3 +++ include/linux/sched.h | 3 +++ kernel/sched/core.c | 48 ++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 2 ++ 5 files changed, 60 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..5dab88707014 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1424,6 +1424,13 @@ static inline void mm_set_cpus_allowed(struct mm_str= uct *mm, const struct cpumas WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed)); raw_spin_unlock(&mm->cpus_allowed_lock); } + +static inline bool mm_cid_needs_scan(struct mm_struct *mm) +{ + if (!mm) + return false; + return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan)); +} #else /* CONFIG_SCHED_MM_CID */ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p= ) { } static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p= ) { return 0; } @@ -1434,6 +1441,10 @@ static inline unsigned int mm_cid_size(void) return 0; } static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) { } +static inline bool mm_cid_needs_scan(struct mm_struct *mm) +{ + return false; +} #endif /* CONFIG_SCHED_MM_CID */ =20 struct mmu_gather; diff --git a/include/linux/rseq.h b/include/linux/rseq.h index b8ea95011ec3..12eecde46ff5 100644 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -4,6 +4,7 @@ =20 #ifdef CONFIG_RSEQ #include +#include =20 void __rseq_handle_slowpath(struct pt_regs *regs); =20 @@ -68,6 +69,8 @@ static __always_inline void rseq_sched_switch_event(struc= t task_struct *t) rseq_raise_notify_resume(t); } } + if (mm_cid_needs_scan(t->mm)) + task_add_mm_cid(t); } =20 /* diff --git a/include/linux/sched.h b/include/linux/sched.h index 857ed17d443b..80c1afb2087d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1407,6 +1407,7 @@ struct task_struct { int last_mm_cid; /* Most recent cid in mm */ int migrate_from_cpu; int mm_cid_active; /* Whether cid bitmap is active */ + unsigned long last_cid_reset; /* Time of last reset in jiffies */ struct callback_head cid_work; #endif =20 @@ -2300,6 +2301,7 @@ void sched_mm_cid_before_execve(struct task_struct *t= ); void sched_mm_cid_after_execve(struct task_struct *t); void sched_mm_cid_fork(struct task_struct *t); void sched_mm_cid_exit_signals(struct task_struct *t); +void task_add_mm_cid(struct task_struct *t); static inline int task_mm_cid(struct task_struct *t) { return t->mm_cid; @@ -2309,6 +2311,7 @@ static inline void sched_mm_cid_before_execve(struct = task_struct *t) { } static inline void sched_mm_cid_after_execve(struct task_struct *t) { } static inline void sched_mm_cid_fork(struct task_struct *t) { } static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } +static inline void task_add_mm_cid(struct task_struct *t) { } static inline int task_mm_cid(struct task_struct *t) { /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e742a655c9a8..30652bb4a223 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10840,19 +10840,53 @@ void init_sched_mm_cid(struct task_struct *t) init_task_work(&t->cid_work, task_mm_cid_work); } =20 +void task_add_mm_cid(struct task_struct *t) +{ + struct callback_head *work =3D &t->cid_work; + + if (work->next !=3D work) + return; + /* No page allocation under rq lock */ + task_work_add(t, work, TWA_RESUME); +} + void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { - struct callback_head *work =3D &curr->cid_work; - unsigned long now =3D jiffies; + u64 rtime =3D curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime; =20 + /* + * If a task is running unpreempted for a long time, it won't get its + * mm_cid compacted and won't update its mm_cid value after a + * compaction occurs. + * For such a task, this function does two things: + * A) trigger the mm_cid recompaction, + * B) trigger an update of the task's rseq->mm_cid field at some point + * after recompaction, so it can get a mm_cid value closer to 0. + * A change in the mm_cid triggers an rseq_preempt. + * + * B occurs once after the compaction work completes, neither A nor B + * run as long as the compaction work is pending, the task is exiting + * or is not a userspace task. + */ if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || - work->next !=3D work) + test_tsk_thread_flag(curr, TIF_NOTIFY_RESUME)) return; - if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan))) + if (rtime < RSEQ_UNPREEMPTED_THRESHOLD) return; - - /* No page allocation under rq lock */ - task_work_add(curr, work, TWA_RESUME); + if (mm_cid_needs_scan(curr->mm)) { + /* Trigger mm_cid recompaction */ + task_add_mm_cid(curr); + } else if (time_after(jiffies, curr->last_cid_reset + + msecs_to_jiffies(MM_CID_SCAN_DELAY))) { + /* Update mm_cid field */ + if (!curr->mm_cid_active) + return; + mm_cid_snapshot_time(rq, curr->mm); + mm_cid_put_lazy(curr); + curr->last_mm_cid =3D curr->mm_cid =3D mm_cid_get(rq, curr, curr->mm); + rseq_sched_set_task_mm_cid(curr, curr->mm_cid); + rseq_sched_switch_event(curr); + } } =20 void sched_mm_cid_exit_signals(struct task_struct *t) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8f14d231e7a7..8c0fb3b0fb35 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3512,6 +3512,7 @@ extern const char *preempt_modes[]; =20 #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ #define MM_CID_SCAN_DELAY 100 /* 100ms */ +#define RSEQ_UNPREEMPTED_THRESHOLD SCHED_MM_CID_PERIOD_NS =20 extern raw_spinlock_t cid_lock; extern int use_cid_lock; @@ -3715,6 +3716,7 @@ static inline int mm_cid_get(struct rq *rq, struct ta= sk_struct *t, int cid; =20 lockdep_assert_rq_held(rq); + t->last_cid_reset =3D jiffies; cpumask =3D mm_cidmask(mm); cid =3D __this_cpu_read(pcpu_cid->cid); if (mm_cid_is_valid(cid)) { --=20 2.51.0 From nobody Wed Oct 1 22:33:18 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FBCC302166 for ; Mon, 29 Sep 2025 11:42:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146177; cv=none; b=dFQfRP0+Dy4zcBd2GO/ceuoNU61V/w8kQ5DKmi3scAEqQXyzCmzOMhAHjbnGm2vp4uOCU9uh4cRGbltsnalCzu+mSCxWAEgZZ8xyzz/bSqtj0VqD3OYwD03GaSXowBYxraFq8N8euPWMu756YJAwuoND+0584MBRbk4pkoxs/0c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146177; c=relaxed/simple; bh=NykoXUSyrKGkgbvux2M+5GWDOj5iJlu5MUP6n9HYUDs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DyBxriOVaYHcyQArO8NlG3rybGX/g+CklUzmsyNKDj2UfSRZElr7RtgIVYQZuBrgnRNacLpEnuvWo2l2Wc2St3Y53xXGV7PyqRgEv+TEAHi2JFdPC4Ooc2GZqVhCNN2g53LVV0xoL+4TeasmNQzbQJM+zCScurhZUWOTFCQgTsI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=fnywTWxI; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="fnywTWxI" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759146174; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FfADzoiCbyBtyAaLw463Wx6LUyiLBEBIcUcZudbfl5k=; b=fnywTWxILq+tOGB52U5Qm3AuquBxQqyZv9rm6C4295bksR42r1/OfIp5lJZrP4iwCuxPbv Oe78ViMEEgvKRCCaMUVfmkfvBJvWs0M6PYP+gcTrQ+mEsPIQ0t9AROCNeZtis0MQBGRjro pQJ6Xk5XKcO6NtKTmr2z1K60WMklULA= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-49-M6vFsBw4NUqOLclGlyQs1Q-1; Mon, 29 Sep 2025 07:42:50 -0400 X-MC-Unique: M6vFsBw4NUqOLclGlyQs1Q-1 X-Mimecast-MFC-AGG-ID: M6vFsBw4NUqOLclGlyQs1Q_1759146169 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A4CE7195608E; Mon, 29 Sep 2025 11:42:48 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.csb (unknown [10.44.32.41]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6910E19560A2; Mon, 29 Sep 2025 11:42:44 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Andrew Morton , David Hildenbrand , linux-mm@kvack.org Cc: Gabriele Monaco Subject: [PATCH v3 3/4] sched: Compact RSEQ concurrency IDs in batches Date: Mon, 29 Sep 2025 13:42:24 +0200 Message-ID: <20250929114225.36172-4-gmonaco@redhat.com> In-Reply-To: <20250929114225.36172-1-gmonaco@redhat.com> References: <20250929114225.36172-1-gmonaco@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Content-Type: text/plain; charset="utf-8" Currently, task_mm_cid_work() is called from resume_user_mode_work(). This can delay the execution of the corresponding thread for the entire duration of the function, negatively affecting the response in case of real time tasks. In practice, we observe task_mm_cid_work increasing the latency of 30-35us on a 128 cores system, this order of magnitude is meaningful under PREEMPT_RT. Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH CPUs, this reduces the duration of the delay for each scan. The task_mm_cid_work contains a mechanism to avoid running more frequently than every 100ms. Keep this pseudo-periodicity only on complete scans. This means each call to task_mm_cid_work returns prematurely if the period did not elapse and a scan is not ongoing (i.e. the next batch to scan is not the first). This way full scans are not excessively delayed while still keeping each run, and introduced latency, short. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_ci= d") Signed-off-by: Gabriele Monaco --- include/linux/mm_types.h | 15 +++++++++++++++ init/Kconfig | 12 ++++++++++++ kernel/sched/core.c | 31 ++++++++++++++++++++++++++++--- 3 files changed, 55 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5dab88707014..83f6dc06b15f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -994,6 +994,13 @@ struct mm_struct { * When the next mm_cid scan is due (in jiffies). */ unsigned long mm_cid_next_scan; + /* + * @mm_cid_scan_batch: Counter for batch used in the next scan. + * + * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH. This field + * increments at each scan and reset when all batches are done. + */ + unsigned int mm_cid_scan_batch; /** * @nr_cpus_allowed: Number of CPUs allowed for mm. * @@ -1389,6 +1396,7 @@ static inline void mm_init_cid(struct mm_struct *mm, = struct task_struct *p) raw_spin_lock_init(&mm->cpus_allowed_lock); cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask); cpumask_clear(mm_cidmask(mm)); + mm->mm_cid_scan_batch =3D 0; } =20 static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_st= ruct *p) @@ -1427,8 +1435,15 @@ static inline void mm_set_cpus_allowed(struct mm_str= uct *mm, const struct cpumas =20 static inline bool mm_cid_needs_scan(struct mm_struct *mm) { + unsigned int next_batch; + if (!mm) return false; + next_batch =3D READ_ONCE(mm->mm_cid_scan_batch); + /* Always needs scan unless it's the first batch. */ + if (CONFIG_RSEQ_CID_SCAN_BATCH * next_batch < num_possible_cpus() && + next_batch) + return true; return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan)); } #else /* CONFIG_SCHED_MM_CID */ diff --git a/init/Kconfig b/init/Kconfig index 854b35e33318..8905d64c2598 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1918,6 +1918,18 @@ config DEBUG_RSEQ =20 If unsure, say N. =20 +config RSEQ_CID_SCAN_BATCH + int "Number of CPUs to scan at every mm_cid compaction attempt" + range 1 NR_CPUS + default 16 + depends on SCHED_MM_CID + help + CPUs are scanned pseudo-periodically to compact the CID of each task, + this operation can take a longer amount of time on systems with many + CPUs, resulting in higher scheduling latency for the current task. + A higher value means the CID is compacted faster, but results in + higher scheduling latency. + config CACHESTAT_SYSCALL bool "Enable cachestat() system call" if EXPERT default y diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 30652bb4a223..14b79c143d26 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10784,11 +10784,11 @@ static void sched_mm_cid_remote_clear_weight(stru= ct mm_struct *mm, int cpu, =20 static void task_mm_cid_work(struct callback_head *work) { + int weight, cpu, from_cpu, this_batch, next_batch, idx; unsigned long now =3D jiffies, old_scan, next_scan; struct task_struct *t =3D current; struct cpumask *cidmask; struct mm_struct *mm; - int weight, cpu; =20 WARN_ON_ONCE(t !=3D container_of(work, struct task_struct, cid_work)); =20 @@ -10798,6 +10798,17 @@ static void task_mm_cid_work(struct callback_head = *work) mm =3D t->mm; if (!mm) return; + this_batch =3D READ_ONCE(mm->mm_cid_scan_batch); + next_batch =3D this_batch + 1; + from_cpu =3D cpumask_nth(this_batch * CONFIG_RSEQ_CID_SCAN_BATCH, + cpu_possible_mask); + if (from_cpu >=3D nr_cpu_ids) { + from_cpu =3D 0; + next_batch =3D 1; + } + /* Delay scan only if we are done with all cpus. */ + if (from_cpu !=3D 0) + goto cid_compact; old_scan =3D READ_ONCE(mm->mm_cid_next_scan); next_scan =3D now + msecs_to_jiffies(MM_CID_SCAN_DELAY); if (!old_scan) { @@ -10813,17 +10824,31 @@ static void task_mm_cid_work(struct callback_head= *work) return; if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan)) return; + +cid_compact: + if (!try_cmpxchg(&mm->mm_cid_scan_batch, &this_batch, next_batch)) + return; cidmask =3D mm_cidmask(mm); /* Clear cids that were not recently used. */ - for_each_possible_cpu(cpu) + idx =3D 0; + cpu =3D from_cpu; + for_each_cpu_from(cpu, cpu_possible_mask) { + if (idx++ =3D=3D CONFIG_RSEQ_CID_SCAN_BATCH) + break; sched_mm_cid_remote_clear_old(mm, cpu); + } weight =3D cpumask_weight(cidmask); /* * Clear cids that are greater or equal to the cidmask weight to * recompact it. */ - for_each_possible_cpu(cpu) + idx =3D 0; + cpu =3D from_cpu; + for_each_cpu_from(cpu, cpu_possible_mask) { + if (idx++ =3D=3D CONFIG_RSEQ_CID_SCAN_BATCH) + break; sched_mm_cid_remote_clear_weight(mm, cpu, weight); + } } =20 void init_sched_mm_cid(struct task_struct *t) --=20 2.51.0 From nobody Wed Oct 1 22:33:18 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8FC2E302CA0 for ; Mon, 29 Sep 2025 11:43:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146182; cv=none; b=gt9+fUSeViMhCTggCI7iAGmV5VPGKWTMcV4DlMWz1SSCq2O1c1SyKlKviNMRyQojtMjUUd5bNw+7XP+wwf96eP8qWkoemo4I9re65lZMHEP9mcIEJdqLB6rr1MxYiCy5jqgdV6xFcxL1GUQfqSUR5BJHanX+j7ztSn2u/VipjC8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759146182; c=relaxed/simple; bh=nEtyUr3R4f130JvlekB1xv6q2n31hAktv+O+ORIhXF8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=E68a9WIsi53wOKcQmlsn5gaqHUZLDRNSiS/Y0hARwqCGvZb6ORYxOAbipSrDYrm2NCv62slNzzBkCmf2JGpFhkTj0DJVPnVblUHq24HJK5qFZFjSHwJXFihwdSDiVfvdleuasxyMibkQrZRd5caHAEBDmieEec4X6awg2614byg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZSj0gg2I; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZSj0gg2I" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759146179; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wK8XYbna/kr6s7zcm8omtA4zXQRbuBS4F3NKIioLVWg=; b=ZSj0gg2IF5T5OourZ/PTpeO5rkku+sRf+lggUxUBchF/ZQRBYEfCnj9npeJ5bIvcg+qF8D l4J1Hz3cob6SgUaYq/DDryAmMHmcPvuSnPqyVDDbwBqzhBJd9o/jkj5JzDF3MT2ZuLGMFY VLBYOUXj/y6Unk5ZzU2Xl3HuMcr5vLM= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-613-10rq5rxoPQanakT1q6Lsrw-1; Mon, 29 Sep 2025 07:42:56 -0400 X-MC-Unique: 10rq5rxoPQanakT1q6Lsrw-1 X-Mimecast-MFC-AGG-ID: 10rq5rxoPQanakT1q6Lsrw_1759146174 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5C239180057C; Mon, 29 Sep 2025 11:42:54 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.csb (unknown [10.44.32.41]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id EC3AE19560B9; Mon, 29 Sep 2025 11:42:49 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , "Paul E. McKenney" , Shuah Khan , linux-kselftest@vger.kernel.org Cc: Gabriele Monaco , Shuah Khan Subject: [PATCH v3 4/4] selftests/rseq: Add test for mm_cid compaction Date: Mon, 29 Sep 2025 13:42:25 +0200 Message-ID: <20250929114225.36172-5-gmonaco@redhat.com> In-Reply-To: <20250929114225.36172-1-gmonaco@redhat.com> References: <20250929114225.36172-1-gmonaco@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Content-Type: text/plain; charset="utf-8" A task in the kernel (task_mm_cid_work) runs somewhat periodically to compact the mm_cid for each process. Add a test to validate that it runs correctly and timely. The test spawns 1 thread pinned to each CPU, then each thread, including the main one, runs in short bursts for some time. During this period, the mm_cids should be spanning all numbers between 0 and nproc. At the end of this phase, a thread with high enough mm_cid (>=3D nproc/2) is selected to be the new leader, all other threads terminate. After some time, the only running thread should see 0 as mm_cid, if that doesn't happen, the compaction mechanism didn't work and the test fails. The test never fails if only 1 core is available, in which case, we cannot test anything as the only available mm_cid is 0. Acked-by: Shuah Khan Signed-off-by: Gabriele Monaco --- tools/testing/selftests/rseq/.gitignore | 1 + tools/testing/selftests/rseq/Makefile | 2 +- .../selftests/rseq/mm_cid_compaction_test.c | 204 ++++++++++++++++++ 3 files changed, 206 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selfte= sts/rseq/.gitignore index 0fda241fa62b..b3920c59bf40 100644 --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -3,6 +3,7 @@ basic_percpu_ops_test basic_percpu_ops_mm_cid_test basic_test basic_rseq_op_test +mm_cid_compaction_test param_test param_test_benchmark param_test_compare_twice diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftest= s/rseq/Makefile index 0d0a5fae5954..bc4d940f66d4 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -17,7 +17,7 @@ OVERRIDE_TARGETS =3D 1 TEST_GEN_PROGS =3D basic_test basic_percpu_ops_test basic_percpu_ops_mm_ci= d_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ - syscall_errors_test + syscall_errors_test mm_cid_compaction_test =20 TEST_GEN_PROGS_EXTENDED =3D librseq.so =20 diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/= testing/selftests/rseq/mm_cid_compaction_test.c new file mode 100644 index 000000000000..d13623625f5a --- /dev/null +++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c @@ -0,0 +1,204 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" +#include "rseq.h" + +#define VERBOSE 0 +#define printf_verbose(fmt, ...) \ + do { \ + if (VERBOSE) \ + printf(fmt, ##__VA_ARGS__); \ + } while (0) + +/* 50 ms */ +#define RUNNER_PERIOD 50000 +/* + * Number of runs before we terminate or get the token. + * The number is slowly increasing with the number of CPUs as the compacti= on + * process can take longer on larger systems. This is an arbitrary value. + */ +#define THREAD_RUNS (3 + args->num_cpus/8) + +/* + * Number of times we check that the mm_cid were compacted. + * Checks are repeated every RUNNER_PERIOD. + */ +#define MM_CID_COMPACT_TIMEOUT 10 + +struct thread_args { + int cpu; + int num_cpus; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + pthread_t *tinfo; + struct thread_args *args_head; +}; + +static void __noreturn *thread_runner(void *arg) +{ + struct thread_args *args =3D arg; + int i, ret, curr_mm_cid; + cpu_set_t cpumask; + + CPU_ZERO(&cpumask); + CPU_SET(args->cpu, &cpumask); + ret =3D pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask); + if (ret) { + errno =3D ret; + perror("Error: failed to set affinity"); + abort(); + } + pthread_barrier_wait(args->barrier); + + for (i =3D 0; i < THREAD_RUNS; i++) + usleep(RUNNER_PERIOD); + curr_mm_cid =3D rseq_current_mm_cid(); + /* + * We select one thread with high enough mm_cid to be the new leader. + * All other threads (including the main thread) will terminate. + * After some time, the mm_cid of the only remaining thread should + * converge to 0, if not, the test fails. + */ + if (curr_mm_cid >=3D args->num_cpus / 2 && + !pthread_mutex_trylock(args->token)) { + printf_verbose( + "cpu%d has mm_cid=3D%d and will be the new leader.\n", + sched_getcpu(), curr_mm_cid); + for (i =3D 0; i < args->num_cpus; i++) { + if (args->tinfo[i] =3D=3D pthread_self()) + continue; + ret =3D pthread_join(args->tinfo[i], NULL); + if (ret) { + errno =3D ret; + perror("Error: failed to join thread"); + abort(); + } + } + pthread_barrier_destroy(args->barrier); + free(args->tinfo); + free(args->token); + free(args->barrier); + free(args->args_head); + + for (i =3D 0; i < MM_CID_COMPACT_TIMEOUT; i++) { + curr_mm_cid =3D rseq_current_mm_cid(); + printf_verbose("run %d: mm_cid=3D%d on cpu%d.\n", i, + curr_mm_cid, sched_getcpu()); + if (curr_mm_cid =3D=3D 0) + exit(EXIT_SUCCESS); + usleep(RUNNER_PERIOD); + } + exit(EXIT_FAILURE); + } + printf_verbose("cpu%d has mm_cid=3D%d and is going to terminate.\n", + sched_getcpu(), curr_mm_cid); + pthread_exit(NULL); +} + +int test_mm_cid_compaction(void) +{ + cpu_set_t affinity; + int i, j, ret =3D 0, num_threads; + pthread_t *tinfo; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + struct thread_args *args; + + sched_getaffinity(0, sizeof(affinity), &affinity); + num_threads =3D CPU_COUNT(&affinity); + tinfo =3D calloc(num_threads, sizeof(*tinfo)); + if (!tinfo) { + perror("Error: failed to allocate tinfo"); + return -1; + } + args =3D calloc(num_threads, sizeof(*args)); + if (!args) { + perror("Error: failed to allocate args"); + ret =3D -1; + goto out_free_tinfo; + } + token =3D malloc(sizeof(*token)); + if (!token) { + perror("Error: failed to allocate token"); + ret =3D -1; + goto out_free_args; + } + barrier =3D malloc(sizeof(*barrier)); + if (!barrier) { + perror("Error: failed to allocate barrier"); + ret =3D -1; + goto out_free_token; + } + if (num_threads =3D=3D 1) { + fprintf(stderr, "Cannot test on a single cpu. " + "Skipping mm_cid_compaction test.\n"); + /* only skipping the test, this is not a failure */ + goto out_free_barrier; + } + pthread_mutex_init(token, NULL); + ret =3D pthread_barrier_init(barrier, NULL, num_threads); + if (ret) { + errno =3D ret; + perror("Error: failed to initialise barrier"); + goto out_free_barrier; + } + for (i =3D 0, j =3D 0; i < CPU_SETSIZE && j < num_threads; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + args[j].num_cpus =3D num_threads; + args[j].tinfo =3D tinfo; + args[j].token =3D token; + args[j].barrier =3D barrier; + args[j].cpu =3D i; + args[j].args_head =3D args; + if (!j) { + /* The first thread is the main one */ + tinfo[0] =3D pthread_self(); + ++j; + continue; + } + ret =3D pthread_create(&tinfo[j], NULL, thread_runner, &args[j]); + if (ret) { + errno =3D ret; + perror("Error: failed to create thread"); + abort(); + } + ++j; + } + printf_verbose("Started %d threads.\n", num_threads); + + /* Also main thread will terminate if it is not selected as leader */ + thread_runner(&args[0]); + + /* only reached in case of errors */ +out_free_barrier: + free(barrier); +out_free_token: + free(token); +out_free_args: + free(args); +out_free_tinfo: + free(tinfo); + + return ret; +} + +int main(int argc, char **argv) +{ + if (!rseq_mm_cid_available()) { + fprintf(stderr, "Error: rseq_mm_cid unavailable\n"); + return -1; + } + if (test_mm_cid_compaction()) + return -1; + return 0; +} --=20 2.51.0