From nobody Sun Jun 28 00:08:49 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED306C433EF for ; Thu, 17 Feb 2022 08:51:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235865AbiBQIvk (ORCPT ); Thu, 17 Feb 2022 03:51:40 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:50624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236112AbiBQIvi (ORCPT ); Thu, 17 Feb 2022 03:51:38 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22E792AB501 for ; Thu, 17 Feb 2022 00:51:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=DapWa3g98AWgDbavPjkX7jiHQC4CW4Q2PzhEQmOzjig=; b=r1Kfw1CNW1cjRqgpaGzji2GZKn 9yE9EKZSYwND0WpCfXpYERbFBpe/VrotHoI3fo7k9EbYCPj6kwgbSeN1woyIIHaEjkRnjuSqt7zT4 SZEpgmqSZqjPPr8N5jWt38DqO2L/N8H5tN38P5lopzAepEZupoxQwYBwACa2aZrAY+Ih2FoeBR1NZ mySNKlQhXrVs4MB7K45F5hcPDGcUs3uGFVtO3cYF8r0PdDZs9nslkIJiKHTD+0689UJsaEFh6X6kY VIDL6Sjre9g8WC78ZWKPdNw5DIrMigv4k8nZ9MzVP94w+KAZ98izgawSCVzlB3D3YOjsZ1fgkjETf ABOOQMxQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nKcVD-00FSIF-Bl; Thu, 17 Feb 2022 08:51:03 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 47F163001C0; Thu, 17 Feb 2022 09:51:01 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 1D51F2024319B; Thu, 17 Feb 2022 09:51:01 +0100 (CET) Date: Thu, 17 Feb 2022 09:51:01 +0100 From: Peter Zijlstra To: Linus Torvalds Cc: Borislav Petkov , Tadeusz Struk , x86-ml , lkml , zhangqiao22@huawei.com, tj@kernel.org, dietmar.eggemann@arm.com Subject: [PATCH] sched: Fix yet more sched_fork() races Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On Mon, Feb 14, 2022 at 10:16:57AM +0100, Peter Zijlstra wrote: > Zhang, Tadeusz, TJ, how does this look? *sigh* I was hoping for some Tested-by, since I've no idea how to operate this cgroup stuff properly. Anyway, full patch below. I'll go stick it in sched/urgent. Reported-by: Linus Torvalds Tested-by: Dietmar Eggemann Tested-by: Tadeusz Struk Tested-by: Zhang Qiao --- Subject: sched: Fix yet more sched_fork() races From: Peter Zijlstra Date: Mon, 14 Feb 2022 10:16:57 +0100 Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sche= d_task_group") Reported-by: Linus Torvalds Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched/task.h | 4 ++-- kernel/fork.c | 13 ++++++++++++- kernel/sched/core.c | 34 +++++++++++++++++++++------------- 3 files changed, 35 insertions(+), 16 deletions(-) --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -54,8 +54,8 @@ extern asmlinkage void schedule_tail(str extern void init_idle(struct task_struct *idle, int cpu); =20 extern int sched_fork(unsigned long clone_flags, struct task_struct *p); -extern void sched_post_fork(struct task_struct *p, - struct kernel_clone_args *kargs); +extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_a= rgs *kargs); +extern void sched_post_fork(struct task_struct *p); extern void sched_dead(struct task_struct *p); =20 void __noreturn do_task_dead(void); --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2266,6 +2266,17 @@ static __latent_entropy struct task_stru goto bad_fork_put_pidfd; =20 /* + * Now that the cgroups are pinned, re-clone the parent cgroup and put + * the new task on the correct runqueue. All this *before* the task + * becomes visible. + * + * This isn't part of ->can_fork() because while the re-cloning is + * cgroup specific, it unconditionally needs to place the task on a + * runqueue. + */ + sched_cgroup_fork(p, args); + + /* * From this point on we must avoid any synchronous user-space * communication until we take the tasklist-lock. In particular, we do * not want user-space to be able to predict the process start-time by @@ -2375,7 +2386,7 @@ static __latent_entropy struct task_stru write_unlock_irq(&tasklist_lock); =20 proc_fork_connector(p); - sched_post_fork(p, args); + sched_post_fork(p); cgroup_post_fork(p, args); perf_event_fork(p); =20 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1215,9 +1215,8 @@ int tg_nop(struct task_group *tg, void * } #endif =20 -static void set_load_weight(struct task_struct *p) +static void set_load_weight(struct task_struct *p, bool update_load) { - bool update_load =3D !(READ_ONCE(p->__state) & TASK_NEW); int prio =3D p->static_prio - MAX_RT_PRIO; struct load_weight *load =3D &p->se.load; =20 @@ -4408,7 +4407,7 @@ int sched_fork(unsigned long clone_flags p->static_prio =3D NICE_TO_PRIO(0); =20 p->prio =3D p->normal_prio =3D p->static_prio; - set_load_weight(p); + set_load_weight(p, false); =20 /* * We don't need the reset flag anymore after the fork. It has @@ -4426,6 +4425,7 @@ int sched_fork(unsigned long clone_flags =20 init_entity_runnable_average(&p->se); =20 + #ifdef CONFIG_SCHED_INFO if (likely(sched_info_on())) memset(&p->sched_info, 0, sizeof(p->sched_info)); @@ -4441,18 +4441,23 @@ int sched_fork(unsigned long clone_flags return 0; } =20 -void sched_post_fork(struct task_struct *p, struct kernel_clone_args *karg= s) +void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *ka= rgs) { unsigned long flags; -#ifdef CONFIG_CGROUP_SCHED - struct task_group *tg; -#endif =20 + /* + * Because we're not yet on the pid-hash, p->pi_lock isn't strictly + * required yet, but lockdep gets upset if rules are violated. + */ raw_spin_lock_irqsave(&p->pi_lock, flags); #ifdef CONFIG_CGROUP_SCHED - tg =3D container_of(kargs->cset->subsys[cpu_cgrp_id], - struct task_group, css); - p->sched_task_group =3D autogroup_task_group(p, tg); + if (1) { + struct task_group *tg; + tg =3D container_of(kargs->cset->subsys[cpu_cgrp_id], + struct task_group, css); + tg =3D autogroup_task_group(p, tg); + p->sched_task_group =3D tg; + } #endif rseq_migrate(p); /* @@ -4463,7 +4468,10 @@ void sched_post_fork(struct task_struct if (p->sched_class->task_fork) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); +} =20 +void sched_post_fork(struct task_struct *p) +{ uclamp_post_fork(p); } =20 @@ -6923,7 +6931,7 @@ void set_user_nice(struct task_struct *p put_prev_task(rq, p); =20 p->static_prio =3D NICE_TO_PRIO(nice); - set_load_weight(p); + set_load_weight(p, true); old_prio =3D p->prio; p->prio =3D effective_prio(p); =20 @@ -7214,7 +7222,7 @@ static void __setscheduler_params(struct */ p->rt_priority =3D attr->sched_priority; p->normal_prio =3D normal_prio(p); - set_load_weight(p); + set_load_weight(p, true); } =20 /* @@ -9447,7 +9455,7 @@ void __init sched_init(void) #endif } =20 - set_load_weight(&init_task); + set_load_weight(&init_task, false); =20 /* * The boot idle thread does lazy MMU switching as well: