From nobody Mon Apr 6 19:57:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E04C2C001B5 for ; Fri, 2 Sep 2022 15:39:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236884AbiIBPjq (ORCPT ); Fri, 2 Sep 2022 11:39:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237186AbiIBPif (ORCPT ); Fri, 2 Sep 2022 11:38:35 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D262EA33F for ; Fri, 2 Sep 2022 08:26:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662132376; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d7F5jNXOig7p4gZdg2FFWDOqFX0Eoffdyr2osFUuJU4=; b=g0lPd1tYzIwTwmvg0xLjTlS/iPpmy3qYaWkgjz0NHGZ3F6MZSYJELFRGP8UHNggw5zVlzH XEQ3CDaPn147g8LTLhsltKxJ8Me8vcvC/3AOr7jm9Ed3HPmPlxWna1zpE4xIPDVTnsUD0o 3CBm+XqCqs05xY4z/idlE3wNV2xaoPM= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-471-leEVtjaXPyqBi8po8Q1KOg-1; Fri, 02 Sep 2022 11:26:11 -0400 X-MC-Unique: leEVtjaXPyqBi8po8Q1KOg-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 224DB2919EC7; Fri, 2 Sep 2022 15:26:10 +0000 (UTC) Received: from llong.com (unknown [10.22.10.219]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7402114152E0; Fri, 2 Sep 2022 15:26:09 +0000 (UTC) From: Waiman Long To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tejun Heo , Zefan Li , Johannes Weiner , Will Deacon Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Lai Jiangshan , Waiman Long Subject: [PATCH v7 1/5] sched: Add __releases annotations to affine_move_task() Date: Fri, 2 Sep 2022 11:25:52 -0400 Message-Id: <20220902152556.373658-2-longman@redhat.com> In-Reply-To: <20220902152556.373658-1-longman@redhat.com> References: <20220902152556.373658-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.85 on 10.11.54.7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" affine_move_task() assumes task_rq_lock() has been called and it does an implicit task_rq_unlock() before returning. Add the appropriate __releases annotations to make this clear. A typo error in comment is also fixed. Signed-off-by: Waiman Long --- kernel/sched/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..b351e6d173b7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2696,6 +2696,8 @@ void release_user_cpus_ptr(struct task_struct *p) */ static int affine_move_task(struct rq *rq, struct task_struct *p, struct r= q_flags *rf, int dest_cpu, unsigned int flags) + __releases(rq->lock) + __releases(p->pi_lock) { struct set_affinity_pending my_pending =3D { }, *pending =3D NULL; bool stop_pending, complete =3D false; @@ -3005,7 +3007,7 @@ static int restrict_cpus_allowed_ptr(struct task_stru= ct *p, =20 /* * Restrict the CPU affinity of task @p so that it is a subset of - * task_cpu_possible_mask() and point @p->user_cpu_ptr to a copy of the + * task_cpu_possible_mask() and point @p->user_cpus_ptr to a copy of the * old affinity mask. If the resulting mask is empty, we warn and walk * up the cpuset hierarchy until we find a suitable mask. */ --=20 2.31.1 From nobody Mon Apr 6 19:57:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F362ECAAD5 for ; Fri, 2 Sep 2022 15:39:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237224AbiIBPjv (ORCPT ); Fri, 2 Sep 2022 11:39:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237198AbiIBPig (ORCPT ); Fri, 2 Sep 2022 11:38:36 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AD394ED006 for ; Fri, 2 Sep 2022 08:26:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662132376; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NimEyZtTzJQVDtCo9T4viDWAW4CqgFOy4rbVvNAhs2E=; b=CtkSgRMFdD+rcFjmgqNQXLQVSeCxoxQ5CP9Aue0OzdDVEoltrQvtzLSDWgUZZP6zNg9H5b L9mDh3OSg42WhYv6xNzyiCg/S15rzXtSjQBYTXfyvwm3jReN8V/Z/RG970EvWNYzxx7mJr f60YoRiWUtQGXtxJjlYr79gLzSo+Hcc= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-576-Qftfn_pEMjyy9fCtaV6clA-1; Fri, 02 Sep 2022 11:26:11 -0400 X-MC-Unique: Qftfn_pEMjyy9fCtaV6clA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E1455380390B; Fri, 2 Sep 2022 15:26:10 +0000 (UTC) Received: from llong.com (unknown [10.22.10.219]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2F36614152E1; Fri, 2 Sep 2022 15:26:10 +0000 (UTC) From: Waiman Long To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tejun Heo , Zefan Li , Johannes Weiner , Will Deacon Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Lai Jiangshan , Waiman Long Subject: [PATCH v7 2/5] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity() Date: Fri, 2 Sep 2022 11:25:53 -0400 Message-Id: <20220902152556.373658-3-longman@redhat.com> In-Reply-To: <20220902152556.373658-1-longman@redhat.com> References: <20220902152556.373658-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.85 on 10.11.54.7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The user_cpus_ptr field is added by commit b90ca8badbd1 ("sched: Introduce task_struct::user_cpus_ptr to track requested affinity"). It is currently used only by arm64 arch due to possible asymmetric CPU setup. This patch extends its usage to save user provided cpumask when sched_setaffinity() is called for all arches. With this patch applied, user_cpus_ptr, once allocated after a successful call to sched_setaffinity(), will only be freed when the task exits. Since user_cpus_ptr is supposed to be used for "requested affinity", there is actually no point to save current cpu affinity in restrict_cpus_allowed_ptr() if sched_setaffinity() has never been called. Modify the logic to set user_cpus_ptr only in sched_setaffinity() and use it in restrict_cpus_allowed_ptr() and relax_compatible_cpus_allowed_ptr() if defined but not changing it. This will be some changes in behavior for arm64 systems with asymmetric CPUs in some corner cases. For instance, if sched_setaffinity() has never been called and there is a cpuset change before relax_compatible_cpus_allowed_ptr() is called, its subsequent call will follow what the cpuset allows but not what the previous cpu affinity setting allows. Signed-off-by: Waiman Long --- kernel/sched/core.c | 82 ++++++++++++++++++++------------------------ kernel/sched/sched.h | 7 ++++ 2 files changed, 44 insertions(+), 45 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b351e6d173b7..c7c0425974c2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2850,7 +2850,6 @@ static int __set_cpus_allowed_ptr_locked(struct task_= struct *p, const struct cpumask *cpu_allowed_mask =3D task_cpu_possible_mask(p); const struct cpumask *cpu_valid_mask =3D cpu_active_mask; bool kthread =3D p->flags & PF_KTHREAD; - struct cpumask *user_mask =3D NULL; unsigned int dest_cpu; int ret =3D 0; =20 @@ -2909,14 +2908,7 @@ static int __set_cpus_allowed_ptr_locked(struct task= _struct *p, =20 __do_set_cpus_allowed(p, new_mask, flags); =20 - if (flags & SCA_USER) - user_mask =3D clear_user_cpus_ptr(p); - - ret =3D affine_move_task(rq, p, rf, dest_cpu, flags); - - kfree(user_mask); - - return ret; + return affine_move_task(rq, p, rf, dest_cpu, flags); =20 out: task_rq_unlock(rq, p, rf); @@ -2951,8 +2943,10 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr); =20 /* * Change a given task's CPU affinity to the intersection of its current - * affinity mask and @subset_mask, writing the resulting mask to @new_mask - * and pointing @p->user_cpus_ptr to a copy of the old mask. + * affinity mask and @subset_mask, writing the resulting mask to @new_mask. + * If user_cpus_ptr is defined, use it as the basis for restricting CPU + * affinity or use cpu_online_mask instead. + * * If the resulting mask is empty, leave the affinity unchanged and return * -EINVAL. */ @@ -2960,17 +2954,10 @@ static int restrict_cpus_allowed_ptr(struct task_st= ruct *p, struct cpumask *new_mask, const struct cpumask *subset_mask) { - struct cpumask *user_mask =3D NULL; struct rq_flags rf; struct rq *rq; int err; =20 - if (!p->user_cpus_ptr) { - user_mask =3D kmalloc(cpumask_size(), GFP_KERNEL); - if (!user_mask) - return -ENOMEM; - } - rq =3D task_rq_lock(p, &rf); =20 /* @@ -2983,25 +2970,15 @@ static int restrict_cpus_allowed_ptr(struct task_st= ruct *p, goto err_unlock; } =20 - if (!cpumask_and(new_mask, &p->cpus_mask, subset_mask)) { + if (!cpumask_and(new_mask, task_user_cpus(p), subset_mask)) { err =3D -EINVAL; goto err_unlock; } =20 - /* - * We're about to butcher the task affinity, so keep track of what - * the user asked for in case we're able to restore it later on. - */ - if (user_mask) { - cpumask_copy(user_mask, p->cpus_ptr); - p->user_cpus_ptr =3D user_mask; - } - return __set_cpus_allowed_ptr_locked(p, new_mask, 0, rq, &rf); =20 err_unlock: task_rq_unlock(rq, p, &rf); - kfree(user_mask); return err; } =20 @@ -3055,30 +3032,21 @@ __sched_setaffinity(struct task_struct *p, const st= ruct cpumask *mask); =20 /* * Restore the affinity of a task @p which was previously restricted by a - * call to force_compatible_cpus_allowed_ptr(). This will clear (and free) - * @p->user_cpus_ptr. + * call to force_compatible_cpus_allowed_ptr(). * * It is the caller's responsibility to serialise this with any calls to * force_compatible_cpus_allowed_ptr(@p). */ void relax_compatible_cpus_allowed_ptr(struct task_struct *p) { - struct cpumask *user_mask =3D p->user_cpus_ptr; - unsigned long flags; + int ret; =20 /* - * Try to restore the old affinity mask. If this fails, then - * we free the mask explicitly to avoid it being inherited across - * a subsequent fork(). + * Try to restore the old affinity mask with __sched_setaffinity(). + * Cpuset masking will be done there too. */ - if (!user_mask || !__sched_setaffinity(p, user_mask)) - return; - - raw_spin_lock_irqsave(&p->pi_lock, flags); - user_mask =3D clear_user_cpus_ptr(p); - raw_spin_unlock_irqrestore(&p->pi_lock, flags); - - kfree(user_mask); + ret =3D __sched_setaffinity(p, task_user_cpus(p)); + WARN_ON_ONCE(ret); } =20 void set_task_cpu(struct task_struct *p, unsigned int new_cpu) @@ -8101,7 +8069,7 @@ __sched_setaffinity(struct task_struct *p, const stru= ct cpumask *mask) if (retval) goto out_free_new_mask; again: - retval =3D __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | SCA_USER); + retval =3D __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK); if (retval) goto out_free_new_mask; =20 @@ -8124,6 +8092,7 @@ __sched_setaffinity(struct task_struct *p, const stru= ct cpumask *mask) =20 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { + struct cpumask *user_mask; struct task_struct *p; int retval; =20 @@ -8158,7 +8127,30 @@ long sched_setaffinity(pid_t pid, const struct cpuma= sk *in_mask) if (retval) goto out_put_task; =20 + user_mask =3D kmalloc(cpumask_size(), GFP_KERNEL); + if (!user_mask) { + retval =3D -ENOMEM; + goto out_put_task; + } + cpumask_copy(user_mask, in_mask); + retval =3D __sched_setaffinity(p, in_mask); + + /* + * Save in_mask into user_cpus_ptr after a successful + * __sched_setaffinity() call. pi_lock is used to synchronize + * changes to user_cpus_ptr. + */ + if (!retval) { + unsigned long flags; + + /* Use pi_lock to synchronize changes to user_cpus_ptr */ + raw_spin_lock_irqsave(&p->pi_lock, flags); + swap(p->user_cpus_ptr, user_mask); + raw_spin_unlock_irqrestore(&p->pi_lock, flags); + } + kfree(user_mask); + out_put_task: put_task_struct(p); return retval; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e26688d387ae..ac235bc8ef08 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1881,6 +1881,13 @@ static inline void dirty_sched_domain_sysctl(int cpu) #endif =20 extern int sched_update_scaling(void); + +static inline const struct cpumask *task_user_cpus(struct task_struct *p) +{ + if (!p->user_cpus_ptr) + return cpu_possible_mask; /* &init_task.cpus_mask */ + return p->user_cpus_ptr; +} #endif /* CONFIG_SMP */ =20 #include "stats.h" --=20 2.31.1 From nobody Mon Apr 6 19:57:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F32CECAAD5 for ; Fri, 2 Sep 2022 15:39:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237199AbiIBPjf (ORCPT ); Fri, 2 Sep 2022 11:39:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57118 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237176AbiIBPie (ORCPT ); Fri, 2 Sep 2022 11:38:34 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3BA9F11A0E for ; Fri, 2 Sep 2022 08:26:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662132375; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pKnGg9J9iDoXVutMmsDPvKNQq77bmJ4L3Oh0dX5BhWU=; b=LvJg4iAYOMorjdonvgzUA1G+FxMEBoyYpuSfH+MkK3ubmIPiepuNVwDde90i8xUw9qCb2N DUEfrl4gemRIE4NBSU8kqQHbutq2Ny4my5X3c/2Z8mgbfCfRBnOqLCIZz6YBl5dD3RNBBi veflQM/o4nUKII6eZciEeAJwJ4/cHiI= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-354-HMdJV6zhPxaks-cYVA-m4Q-1; Fri, 02 Sep 2022 11:26:12 -0400 X-MC-Unique: HMdJV6zhPxaks-cYVA-m4Q-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9E7CE380390D; Fri, 2 Sep 2022 15:26:11 +0000 (UTC) Received: from llong.com (unknown [10.22.10.219]) by smtp.corp.redhat.com (Postfix) with ESMTP id EF13A14152E1; Fri, 2 Sep 2022 15:26:10 +0000 (UTC) From: Waiman Long To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tejun Heo , Zefan Li , Johannes Weiner , Will Deacon Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Lai Jiangshan , Waiman Long Subject: [PATCH v7 3/5] sched: Enforce user requested affinity Date: Fri, 2 Sep 2022 11:25:54 -0400 Message-Id: <20220902152556.373658-4-longman@redhat.com> In-Reply-To: <20220902152556.373658-1-longman@redhat.com> References: <20220902152556.373658-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.85 on 10.11.54.7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" It was found that the user requested affinity via sched_setaffinity() can be easily overwritten by other kernel subsystems without an easy way to reset it back to what the user requested. For example, any change to the current cpuset hierarchy may reset the cpumask of the tasks in the affected cpusets to the default cpuset value even if those tasks have pre-existing user requested affinity. That is especially easy to trigger under a cgroup v2 environment where writing "+cpuset" to the root cgroup's cgroup.subtree_control file will reset the cpus affinity of all the processes in the system. That is problematic in a nohz_full environment where the tasks running in the nohz_full CPUs usually have their cpus affinity explicitly set and will behave incorrectly if cpus affinity changes. Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr() and use it to restrcit the given cpumask unless there is no overlap. In that case, it will fallback to the given one. The SCA_USER flag is reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr masking should be skipped. All callers of set_cpus_allowed_ptr() will be affected by this change. A scratch cpumask is added to percpu runqueues structure for doing additional masking when user_cpus_ptr is set. Signed-off-by: Waiman Long --- kernel/sched/core.c | 17 ++++++++++++----- kernel/sched/sched.h | 3 +++ 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c7c0425974c2..84544daf3839 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2932,6 +2932,10 @@ static int __set_cpus_allowed_ptr(struct task_struct= *p, struct rq *rq; =20 rq =3D task_rq_lock(p, &rf); + if (p->user_cpus_ptr && !(flags & SCA_USER) && + cpumask_and(rq->scratch_mask, new_mask, p->user_cpus_ptr)) + new_mask =3D rq->scratch_mask; + return __set_cpus_allowed_ptr_locked(p, new_mask, flags, rq, &rf); } =20 @@ -3028,7 +3032,7 @@ void force_compatible_cpus_allowed_ptr(struct task_st= ruct *p) } =20 static int -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask); +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int= flags); =20 /* * Restore the affinity of a task @p which was previously restricted by a @@ -3045,7 +3049,7 @@ void relax_compatible_cpus_allowed_ptr(struct task_st= ruct *p) * Try to restore the old affinity mask with __sched_setaffinity(). * Cpuset masking will be done there too. */ - ret =3D __sched_setaffinity(p, task_user_cpus(p)); + ret =3D __sched_setaffinity(p, task_user_cpus(p), 0); WARN_ON_ONCE(ret); } =20 @@ -8049,7 +8053,7 @@ int dl_task_check_affinity(struct task_struct *p, con= st struct cpumask *mask) #endif =20 static int -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask) +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int= flags) { int retval; cpumask_var_t cpus_allowed, new_mask; @@ -8069,7 +8073,7 @@ __sched_setaffinity(struct task_struct *p, const stru= ct cpumask *mask) if (retval) goto out_free_new_mask; again: - retval =3D __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK); + retval =3D __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | flags); if (retval) goto out_free_new_mask; =20 @@ -8134,7 +8138,7 @@ long sched_setaffinity(pid_t pid, const struct cpumas= k *in_mask) } cpumask_copy(user_mask, in_mask); =20 - retval =3D __sched_setaffinity(p, in_mask); + retval =3D __sched_setaffinity(p, in_mask, SCA_USER); =20 /* * Save in_mask into user_cpus_ptr after a successful @@ -9647,6 +9651,9 @@ void __init sched_init(void) cpumask_size(), GFP_KERNEL, cpu_to_node(i)); per_cpu(select_rq_mask, i) =3D (cpumask_var_t)kzalloc_node( cpumask_size(), GFP_KERNEL, cpu_to_node(i)); + per_cpu(runqueues.scratch_mask, i) =3D + (cpumask_var_t)kzalloc_node(cpumask_size(), + GFP_KERNEL, cpu_to_node(i)); } #endif /* CONFIG_CPUMASK_OFFSTACK */ =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ac235bc8ef08..482b702d65ea 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1159,6 +1159,9 @@ struct rq { unsigned int core_forceidle_occupation; u64 core_forceidle_start; #endif + + /* Scratch cpumask to be temporarily used under rq_lock */ + cpumask_var_t scratch_mask; }; =20 #ifdef CONFIG_FAIR_GROUP_SCHED --=20 2.31.1 From nobody Mon Apr 6 19:57:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49676C001B5 for ; Fri, 2 Sep 2022 15:39:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237187AbiIBPj3 (ORCPT ); Fri, 2 Sep 2022 11:39:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237177AbiIBPie (ORCPT ); Fri, 2 Sep 2022 11:38:34 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64C7AE97C0 for ; Fri, 2 Sep 2022 08:26:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662132376; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0wy6hLjATy79OL1SqC+LiGsQtcH52b/o/uPkZmmneno=; b=P8Qb/lD0AwQdMW/yZbcJY75R48u2BQubqDRoGjOa4hVXoHNilAXjctK1SG6pbJneurB9Th jUzlEr0sKkTtBQFXfwaBipiknlXhGNMPXwmVMd/xU7EtmL5xJrww6zEkxKvjGYCyDSsHeE +IQUSl0G5MDGAbfmeqlWb0LKm9ovWwA= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-640-bFzY8bM2PeqUhwJrY892YQ-1; Fri, 02 Sep 2022 11:26:13 -0400 X-MC-Unique: bFzY8bM2PeqUhwJrY892YQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5DC14823F09; Fri, 2 Sep 2022 15:26:12 +0000 (UTC) Received: from llong.com (unknown [10.22.10.219]) by smtp.corp.redhat.com (Postfix) with ESMTP id AC552141513F; Fri, 2 Sep 2022 15:26:11 +0000 (UTC) From: Waiman Long To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tejun Heo , Zefan Li , Johannes Weiner , Will Deacon Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Lai Jiangshan , Waiman Long Subject: [PATCH v7 4/5] sched: Handle set_cpus_allowed_ptr() & sched_setaffinity() race Date: Fri, 2 Sep 2022 11:25:55 -0400 Message-Id: <20220902152556.373658-5-longman@redhat.com> In-Reply-To: <20220902152556.373658-1-longman@redhat.com> References: <20220902152556.373658-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.85 on 10.11.54.7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Racing is possible between set_cpus_allowed_ptr() and sched_setaffinity() or between multiple sched_setaffinity() calls from different CPUs. To resolve these race conditions, we need to update both user_cpus_ptr and cpus_mask in a single lock critical section instead of separated ones. This requires moving the user_cpus_ptr update to set_cpus_allowed_common(). The SCA_USER flag will be used to signify that a user_cpus_ptr update will have to be done. The new user_cpus_ptr will be put into the a percpu variable pending_user_mask at the beginning of the lock crtical section. The pending user mask will then be taken up in set_cpus_allowed_common(). Ideally, user_cpus_ptr should only be updated if the sched_setaffinity() is successful. However, this patch will update user_cpus_ptr when the first call to __set_cpus_allowed_ptr() is successful. However, if there is racing between sched_setaffinity() and cpuset update, the subsequent calls to __set_cpus_allowed_ptr() may fail but the user_cpus_ptr will still be updated in this corner case. A warning will be printed in this corner case. Signed-off-by: Waiman Long --- kernel/sched/core.c | 59 ++++++++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 84544daf3839..618341d0fa51 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -111,6 +111,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp); EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); =20 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(struct cpumask **, pending_user_mask); =20 #ifdef CONFIG_SCHED_DEBUG /* @@ -2199,6 +2200,7 @@ __do_set_cpus_allowed(struct task_struct *p, const st= ruct cpumask *new_mask, u32 =20 static int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, + struct cpumask **puser_mask, u32 flags); =20 static void migrate_disable_switch(struct rq *rq, struct task_struct *p) @@ -2249,7 +2251,8 @@ void migrate_enable(void) */ preempt_disable(); if (p->cpus_ptr !=3D &p->cpus_mask) - __set_cpus_allowed_ptr(p, &p->cpus_mask, SCA_MIGRATE_ENABLE); + __set_cpus_allowed_ptr(p, &p->cpus_mask, NULL, + SCA_MIGRATE_ENABLE); /* * Mustn't clear migration_disabled() until cpus_ptr points back at the * regular cpus_mask, otherwise things that race (eg. @@ -2538,6 +2541,12 @@ void set_cpus_allowed_common(struct task_struct *p, = const struct cpumask *new_ma =20 cpumask_copy(&p->cpus_mask, new_mask); p->nr_cpus_allowed =3D cpumask_weight(new_mask); + + /* + * Swap in the new user_cpus_ptr if SCA_USER flag set + */ + if (flags & SCA_USER) + swap(p->user_cpus_ptr, *__this_cpu_read(pending_user_mask)); } =20 static void @@ -2926,12 +2935,19 @@ static int __set_cpus_allowed_ptr_locked(struct tas= k_struct *p, * call is not atomic; no spinlocks may be held. */ static int __set_cpus_allowed_ptr(struct task_struct *p, - const struct cpumask *new_mask, u32 flags) + const struct cpumask *new_mask, + struct cpumask **puser_mask, + u32 flags) { struct rq_flags rf; struct rq *rq; =20 rq =3D task_rq_lock(p, &rf); + /* + * CPU won't be preempted or interrupted while holding task_rq_lock(). + */ + __this_cpu_write(pending_user_mask, puser_mask); + if (p->user_cpus_ptr && !(flags & SCA_USER) && cpumask_and(rq->scratch_mask, new_mask, p->user_cpus_ptr)) new_mask =3D rq->scratch_mask; @@ -2941,7 +2957,7 @@ static int __set_cpus_allowed_ptr(struct task_struct = *p, =20 int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_= mask) { - return __set_cpus_allowed_ptr(p, new_mask, 0); + return __set_cpus_allowed_ptr(p, new_mask, NULL, 0); } EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr); =20 @@ -3032,7 +3048,8 @@ void force_compatible_cpus_allowed_ptr(struct task_st= ruct *p) } =20 static int -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int= flags); +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, + struct cpumask **puser_mask, int flags); =20 /* * Restore the affinity of a task @p which was previously restricted by a @@ -3049,7 +3066,7 @@ void relax_compatible_cpus_allowed_ptr(struct task_st= ruct *p) * Try to restore the old affinity mask with __sched_setaffinity(). * Cpuset masking will be done there too. */ - ret =3D __sched_setaffinity(p, task_user_cpus(p), 0); + ret =3D __sched_setaffinity(p, task_user_cpus(p), NULL, 0); WARN_ON_ONCE(ret); } =20 @@ -3529,6 +3546,7 @@ void sched_set_stop_task(int cpu, struct task_struct = *stop) =20 static inline int __set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask, + struct cpumask *user_mask, u32 flags) { return set_cpus_allowed_ptr(p, new_mask); @@ -8053,7 +8071,8 @@ int dl_task_check_affinity(struct task_struct *p, con= st struct cpumask *mask) #endif =20 static int -__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, int= flags) +__sched_setaffinity(struct task_struct *p, const struct cpumask *mask, + struct cpumask **puser_mask, int flags) { int retval; cpumask_var_t cpus_allowed, new_mask; @@ -8072,8 +8091,10 @@ __sched_setaffinity(struct task_struct *p, const str= uct cpumask *mask, int flags retval =3D dl_task_check_affinity(p, new_mask); if (retval) goto out_free_new_mask; + + retval =3D __set_cpus_allowed_ptr(p, new_mask, puser_mask, + SCA_CHECK | flags); again: - retval =3D __set_cpus_allowed_ptr(p, new_mask, SCA_CHECK | flags); if (retval) goto out_free_new_mask; =20 @@ -8084,6 +8105,14 @@ __sched_setaffinity(struct task_struct *p, const str= uct cpumask *mask, int flags * Just reset the cpumask to the cpuset's cpus_allowed. */ cpumask_copy(new_mask, cpus_allowed); + retval =3D __set_cpus_allowed_ptr(p, new_mask, NULL, SCA_CHECK); + + /* + * Warn in case of the unexpected success in updating + * user_cpus_ptr in first __set_cpus_allowed_ptr() but then + * fails in a subsequent retry. + */ + WARN_ON_ONCE(retval && (flags | SCA_USER)); goto again; } =20 @@ -8138,21 +8167,7 @@ long sched_setaffinity(pid_t pid, const struct cpuma= sk *in_mask) } cpumask_copy(user_mask, in_mask); =20 - retval =3D __sched_setaffinity(p, in_mask, SCA_USER); - - /* - * Save in_mask into user_cpus_ptr after a successful - * __sched_setaffinity() call. pi_lock is used to synchronize - * changes to user_cpus_ptr. - */ - if (!retval) { - unsigned long flags; - - /* Use pi_lock to synchronize changes to user_cpus_ptr */ - raw_spin_lock_irqsave(&p->pi_lock, flags); - swap(p->user_cpus_ptr, user_mask); - raw_spin_unlock_irqrestore(&p->pi_lock, flags); - } + retval =3D __sched_setaffinity(p, in_mask, &user_mask, SCA_USER); kfree(user_mask); =20 out_put_task: --=20 2.31.1 From nobody Mon Apr 6 19:57:55 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 292D8ECAAD5 for ; Fri, 2 Sep 2022 15:39:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237231AbiIBPjy (ORCPT ); Fri, 2 Sep 2022 11:39:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60552 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237232AbiIBPij (ORCPT ); Fri, 2 Sep 2022 11:38:39 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4BA2B65825 for ; Fri, 2 Sep 2022 08:26:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1662132380; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=b+EEFoR1hsnuIiJiHIQPhbUi8fFyaPXLGuuGMxFE3I0=; b=Yhl915Pr07j6uHHbwwfDzhEb8WJKAPTfWmA+jBc+DnTzIbjK3n9nKG0tPsRS5uP5XUoKFH nmBrAmVR/dcbudAC6hwoOfQ5ndqXwatPDvJxnDvq/HZDD9BxCfBHV9+IEWWvkxrLw1RQd4 XvEGgUp2B5A2cSTx/LiOU3tVOrdPHIg= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-49-QkvML-CBMrKS7ltBc-uHlA-1; Fri, 02 Sep 2022 11:26:13 -0400 X-MC-Unique: QkvML-CBMrKS7ltBc-uHlA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1E52F2919EC7; Fri, 2 Sep 2022 15:26:13 +0000 (UTC) Received: from llong.com (unknown [10.22.10.219]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6A0DE1415139; Fri, 2 Sep 2022 15:26:12 +0000 (UTC) From: Waiman Long To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tejun Heo , Zefan Li , Johannes Weiner , Will Deacon Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Lai Jiangshan , Waiman Long Subject: [PATCH v7 5/5] sched: Fix sched_setaffinity() and fork/clone() race Date: Fri, 2 Sep 2022 11:25:56 -0400 Message-Id: <20220902152556.373658-6-longman@redhat.com> In-Reply-To: <20220902152556.373658-1-longman@redhat.com> References: <20220902152556.373658-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 2.85 on 10.11.54.7 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" sched_setaffinity() can also race with a concurrent fork/clone() syscall calling dup_user_cpus_ptr(). That may lead to a use after free problem. Fix that by protecting the cpumask copying using pi_lock of the source task. Signed-off-by: Waiman Long --- kernel/sched/core.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 618341d0fa51..7157c9a3f31e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2602,6 +2602,8 @@ void do_set_cpus_allowed(struct task_struct *p, const= struct cpumask *new_mask) int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node) { + unsigned long flags; + if (!src->user_cpus_ptr) return 0; =20 @@ -2609,7 +2611,10 @@ int dup_user_cpus_ptr(struct task_struct *dst, struc= t task_struct *src, if (!dst->user_cpus_ptr) return -ENOMEM; =20 + /* Use pi_lock to protect content of user_cpus_ptr */ + raw_spin_lock_irqsave(&src->pi_lock, flags); cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr); + raw_spin_unlock_irqrestore(&src->pi_lock, flags); return 0; } =20 --=20 2.31.1