From nobody Mon May 25 08:11:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94EA52D1907 for ; Sat, 16 May 2026 04:25:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905523; cv=none; b=DtAmSoIu1YRjfh4uBgOSvY/pVPOnZCPcwkrKdJbaAbmNJTK9E0My/SBW+1IXyRtcEXdb2ag5Ey6ZP+ewixvXpWjy+sPbHLKVuyz0gReZCGvP5K6biCMcAib/83juzJ31LLiBXYhZZkrVAVCvWHSm59dEynTb6IInXJU1pt2NQEM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905523; c=relaxed/simple; bh=7GhbxdSY7+sfNr9B19nBc/YPJyog7BtsacD5usebgTg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=heqNpnw7M0bMxIm8u7NfN5ISnagrtb2p01ji1bdvaIW0Oj4i05V8/27ooh4pmrkbFZaLc4QlQXC5iJ77pJBmtImQsYTANo4J9Az/xLv3s73W/nWxoK+XBNL2ryZaZ/p74kZ8eVigQnQ7rsGOsOqLA1kRXeAv4ZzcmWsSvqnLDFc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=iBj9dxMB; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="iBj9dxMB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778905521; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3vjyTSbf1KtAbtYJwAX+RB0XaZHDboatGaFSQtnfftE=; b=iBj9dxMBMMrT/OUCDz3flTYUe475ns4vAu1ga8v8XURxpUq1ctDwKq6qNXKddvwttRrwRn KSn9QSIUIP9uIxBrzCrwbGrIYrT/QiC2/b5g3fGI3pruMBbMyMKKacJ61rXOmqEBqO2Mnb i+UZ7YXOuiRkdFf8aSf4YqNAt+P2Xuw= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-650-wo-uFs6SOmmStEBSYOot4g-1; Sat, 16 May 2026 00:25:19 -0400 X-MC-Unique: wo-uFs6SOmmStEBSYOot4g-1 X-Mimecast-MFC-AGG-ID: wo-uFs6SOmmStEBSYOot4g_1778905517 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AFA641800367; Sat, 16 May 2026 04:25:16 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.2.16.156]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AE8891803A8E; Sat, 16 May 2026 04:25:11 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Waiman Long Subject: [PATCH cgroup/for-next v2 1/5] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper Date: Sat, 16 May 2026 00:24:44 -0400 Message-ID: <20260516042448.698216-2-longman@redhat.com> In-Reply-To: <20260516042448.698216-1-longman@redhat.com> References: <20260516042448.698216-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Extract the DL bandwidth allocation code in cpuset_attach() to a new cpuset_reserve_dl_bw() helper to simplify code. No functional change is expected. Signed-off-by: Waiman Long Reviewed-by: Aaron Tomlin Reviewed-by: Chen Ridong --- kernel/cgroup/cpuset.c | 53 ++++++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 23 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bcefc9f50ac5..7cae47829013 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2980,6 +2980,25 @@ static int cpuset_can_attach_check(struct cpuset *cs) return 0; } =20 +static int cpuset_reserve_dl_bw(struct cpuset *cs) +{ + int cpu, ret; + + if (!cs->sum_migrate_dl_bw) + return 0; + + cpu =3D cpumask_any_and(cpu_active_mask, cs->effective_cpus); + if (unlikely(cpu >=3D nr_cpu_ids)) + return -EINVAL; + + ret =3D dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); + if (ret) + return ret; + + cs->dl_bw_cpu =3D cpu; + return 0; +} + static void reset_migrate_dl_data(struct cpuset *cs) { cs->nr_migrate_dl_tasks =3D 0; @@ -2994,7 +3013,7 @@ static int cpuset_can_attach(struct cgroup_taskset *t= set) struct cpuset *cs, *oldcs; struct task_struct *task; bool setsched_check; - int cpu, ret; + int ret; =20 /* used later by cpuset_attach() */ cpuset_attach_old_cs =3D task_cs(cgroup_taskset_first(tset, &css)); @@ -3050,31 +3069,19 @@ static int cpuset_can_attach(struct cgroup_taskset = *tset) } } =20 - if (!cs->sum_migrate_dl_bw) - goto out_success; - - cpu =3D cpumask_any_and(cpu_active_mask, cs->effective_cpus); - if (unlikely(cpu >=3D nr_cpu_ids)) { - ret =3D -EINVAL; - goto out_unlock; - } - - ret =3D dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); - if (ret) - goto out_unlock; - - cs->dl_bw_cpu =3D cpu; - -out_success: - /* - * Mark attach is in progress. This makes validate_change() fail - * changes which zero cpus/mems_allowed. - */ - cs->attach_in_progress++; + ret =3D cpuset_reserve_dl_bw(cs); =20 out_unlock: - if (ret) + if (ret) { reset_migrate_dl_data(cs); + } else { + /* + * Mark attach is in progress. This makes validate_change() fail + * changes which zero cpus/mems_allowed. + */ + cs->attach_in_progress++; + } + mutex_unlock(&cpuset_mutex); return ret; } --=20 2.54.0 From nobody Mon May 25 08:11:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FCC029BDBD for ; Sat, 16 May 2026 04:25:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905529; cv=none; b=Q4KwrIjgAqO0tCytp71Gue61v5NBqoBpM/xP5WmCX3qCDzca/4/7nA/gck+OCkNzIeiOQ4SD6X9bj6Nans3+v36vl7+JjyadCpbQlQNO/UFOGSpvm2zzQ6TYabM2YVFutNRNoEjTLB8/8DYelDTnxbhScthNiWYK+JKnM3MSkd8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905529; c=relaxed/simple; bh=uUsIjFaK+LXBH8M5aVRApdAL1NZmQlifJ1IwmdNZ6N4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=a3lOQlnuZctUnIfw/9C29r2JRAsbWMKDqRPRqAxSowfzB/+QzPYZxXP7W/FjriU7bzo7MDotjogpX7VsBy9IOZhtVckJOLz1L316H+QRBMIJcl7lLRfjrCUq1lfweDgpj9CLfVUBBmwUxVKsrQQka5pqo+/lm+UN+eZg001CZ1g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Op3ouc1Q; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Op3ouc1Q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778905527; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wsuL1GmbN3+LE9cyHpY+KKjV64zhgphp8mrpxkwLnYA=; b=Op3ouc1Qsg3fyo8/C3uk0uSC0HBdPVzdqH3ieAeLwyvGJ1olN3HgTGzof3yuur+TDtLgBb fAldNhFWum7IAEG+hRrkFJaqOTuC9M5DovVaqBAP8YeaVEZm4QNQXp1s3fTGofMVRAoToO AMd5aKYoL9/t0TaVq5Chl08X80pW+sc= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-378-8TJRAHe0MQaDVMm1W5jihg-1; Sat, 16 May 2026 00:25:23 -0400 X-MC-Unique: 8TJRAHe0MQaDVMm1W5jihg-1 X-Mimecast-MFC-AGG-ID: 8TJRAHe0MQaDVMm1W5jihg_1778905522 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 921671800367; Sat, 16 May 2026 04:25:21 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.2.16.156]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 29CFD1800576; Sat, 16 May 2026 04:25:16 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Waiman Long Subject: [PATCH cgroup/for-next v2 2/5] cgroup/cpuset: Expand the scope of cpuset_can_attach_check() Date: Sat, 16 May 2026 00:24:45 -0400 Message-ID: <20260516042448.698216-3-longman@redhat.com> In-Reply-To: <20260516042448.698216-1-longman@redhat.com> References: <20260516042448.698216-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Expand the scope of cpuset_can_attach_check() by including the setting of setsched flag inside cpuset_can_attach_check() with the new @oldcs and @psetsched argument. As cpuset_can_attach_check() is also called from cpuset_can_fork(), set the new arguments to NULL from that caller. While at it, expose the source and destination cpuset cpu/memory check results in the new attach_cpus_updated and attach_mems_updated static flags so that these flags can be used directly from cpuset_attach() without the need to do the same computations again. No functional change is expected. Signed-off-by: Waiman Long Reviewed-by: Chen Ridong --- kernel/cgroup/cpuset.c | 70 +++++++++++++++++++++++++----------------- 1 file changed, 42 insertions(+), 28 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7cae47829013..0d01b66f464d 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2964,19 +2964,56 @@ static int update_prstate(struct cpuset *cs, int ne= w_prs) return 0; } =20 +/* + * cpuset_can_attach() and cpuset_attach() specific internal data + * Protected by cpuset_mutex + */ static struct cpuset *cpuset_attach_old_cs; +static bool attach_cpus_updated; +static bool attach_mems_updated; =20 /* * Check to see if a cpuset can accept a new task * For v1, cpus_allowed and mems_allowed can't be empty. * For v2, effective_cpus can't be empty. * Note that in v1, effective_cpus =3D cpus_allowed. + * + * Also set the boolean flag passed in by @psetsched depending on if + * security_task_setscheduler() call is needed and @oldcs is not NULL. */ -static int cpuset_can_attach_check(struct cpuset *cs) +static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs, + bool *psetsched) { if (cpumask_empty(cs->effective_cpus) || (!is_in_v2_mode() && nodes_empty(cs->mems_allowed))) return -ENOSPC; + + if (!oldcs) + return 0; + + /* + * Update attach specific data + */ + attach_cpus_updated =3D !cpumask_equal(cs->effective_cpus, oldcs->effecti= ve_cpus); + attach_mems_updated =3D !nodes_equal(cs->effective_mems, oldcs->effective= _mems); + + /* + * Skip rights over task setsched check in v2 when nothing changes, + * migration permission derives from hierarchy ownership in + * cgroup_procs_write_permission()). + */ + *psetsched =3D !cpuset_v2() || attach_cpus_updated || attach_mems_updated; + + /* + * A v1 cpuset with tasks will have no CPU left only when CPU hotplug + * brings the last online CPU offline as users are not allowed to empty + * cpuset.cpus when there are active tasks inside. When that happens, + * we should allow tasks to migrate out without security check to make + * sure they will be able to run after migration. + */ + if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus)) + *psetsched =3D false; + return 0; } =20 @@ -3023,29 +3060,10 @@ static int cpuset_can_attach(struct cgroup_taskset = *tset) mutex_lock(&cpuset_mutex); =20 /* Check to see if task is allowed in the cpuset */ - ret =3D cpuset_can_attach_check(cs); + ret =3D cpuset_can_attach_check(cs, oldcs, &setsched_check); if (ret) goto out_unlock; =20 - /* - * Skip rights over task setsched check in v2 when nothing changes, - * migration permission derives from hierarchy ownership in - * cgroup_procs_write_permission()). - */ - setsched_check =3D !cpuset_v2() || - !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) || - !nodes_equal(cs->effective_mems, oldcs->effective_mems); - - /* - * A v1 cpuset with tasks will have no CPU left only when CPU hotplug - * brings the last online CPU offline as users are not allowed to empty - * cpuset.cpus when there are active tasks inside. When that happens, - * we should allow tasks to migrate out without security check to make - * sure they will be able to run after migration. - */ - if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus)) - setsched_check =3D false; - cgroup_taskset_for_each(task, css, tset) { ret =3D task_can_attach(task); if (ret) @@ -3140,7 +3158,6 @@ static void cpuset_attach(struct cgroup_taskset *tset) struct cgroup_subsys_state *css; struct cpuset *cs; struct cpuset *oldcs =3D cpuset_attach_old_cs; - bool cpus_updated, mems_updated; bool queue_task_work =3D false; =20 cgroup_taskset_first(tset, &css); @@ -3148,9 +3165,6 @@ static void cpuset_attach(struct cgroup_taskset *tset) =20 lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ mutex_lock(&cpuset_mutex); - cpus_updated =3D !cpumask_equal(cs->effective_cpus, - oldcs->effective_cpus); - mems_updated =3D !nodes_equal(cs->effective_mems, oldcs->effective_mems); =20 /* * In the default hierarchy, enabling cpuset in the child cgroups @@ -3158,7 +3172,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) * in effective cpus and mems. In that case, we can optimize out * by skipping the task iteration and update. */ - if (cpuset_v2() && !cpus_updated && !mems_updated) { + if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) { cpuset_attach_nodemask_to =3D cs->effective_mems; goto out; } @@ -3175,7 +3189,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) * not set. */ cpuset_attach_nodemask_to =3D cs->effective_mems; - if (!is_memory_migrate(cs) && !mems_updated) + if (!is_memory_migrate(cs) && !attach_mems_updated) goto out; =20 cgroup_taskset_for_each_leader(leader, css, tset) { @@ -3590,7 +3604,7 @@ static int cpuset_can_fork(struct task_struct *task, = struct css_set *cset) mutex_lock(&cpuset_mutex); =20 /* Check to see if task is allowed in the cpuset */ - ret =3D cpuset_can_attach_check(cs); + ret =3D cpuset_can_attach_check(cs, NULL, NULL); if (ret) goto out_unlock; =20 --=20 2.54.0 From nobody Mon May 25 08:11:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A20317AE11 for ; Sat, 16 May 2026 04:25:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905538; cv=none; b=EUDnxfa6swt59PZAxylLBCgEbQUybLfq0rP3DVFCsfppe2XYRt+en27/WeF6nUlqJq0JEPvDBevhxxGLXmdf1B9D7mERemXE+h+hhXR3p/8Sb1vbpCPTjM2jTiX/po8bgkPYvlO3PQ+GASze9dTJUGpjOCXvu1R2T0KSE5rVBvg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905538; c=relaxed/simple; bh=rbGdLUtRh/45TNs0kUnr6cNXG9aMg3sFaODuVTnN5U8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hw4ZqPw8SyXWkIe7EJmiZ3QfvvBpUsV7VaH/zaOywWZ+h7v/bCRX/STte99tnZtRtjUYfEhs99uS/DWQ9YA4N+swz1Qvvq6O7eLriVwl7XBpyRnhZ+0PZI9dgzFfRptIkVNarn1xEs/D3HrnyF1nWh7Z4J7UA7QR1vBaIDLfjQU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Yu6QOaNS; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Yu6QOaNS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778905536; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xZN+A9sq5vOtV3HLuxvOtE/eENwhZf3tS4rk+PEIQlQ=; b=Yu6QOaNSfTJrbCq5jb4+me0ylXMweyvNjijBdlDEe9hAwLedmqu/4hQo27GuGWaafUg4Qr OfnveyLHpgs4L42ytxGX5eVf1Qj14whD/83IddoIdrZgUOwyBBW/sX4IIufI/MX5kzDPy6 82sVanRgg0/w0RanOAyMmOt+hZB3U14= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-587-Dfv19Cw9PxSm5enAObbgCQ-1; Sat, 16 May 2026 00:25:30 -0400 X-MC-Unique: Dfv19Cw9PxSm5enAObbgCQ-1 X-Mimecast-MFC-AGG-ID: Dfv19Cw9PxSm5enAObbgCQ_1778905528 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 30295180056E; Sat, 16 May 2026 04:25:27 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.2.16.156]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 1C0C01800465; Sat, 16 May 2026 04:25:21 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Waiman Long Subject: [PATCH cgroup/for-next v2 3/5] cgroup/cpuset: Replace cpuset_attach_old_cs by a new attach_old_cs field in task_struct Date: Sat, 16 May 2026 00:24:46 -0400 Message-ID: <20260516042448.698216-4-longman@redhat.com> In-Reply-To: <20260516042448.698216-1-longman@redhat.com> References: <20260516042448.698216-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" In cpuset_can_attach(), the source (old) cpuset of the tasks is stored in an internal cpuset_attach_old_cs variable to be used later in cpuset_attach(). It is because such task to old cpuset information is no longer available when cpuset_attach() is called and it is assumed that there is only one source cpuset. To support cgroup_taskset containing tasks from multiple source cpusets, such an approach will no longer work. The easier way to get the old cpuset information is to temporarily store that information in the task_struct itself at cpuset_can_attach() and reuse it in cpuset_attach(). However, that does increase the size of task_struct by a 8 bytes for 64-bit kernel. Add a new attach_old_cs field into task_struct for such purpose and retire the cpuset_attach_old_cs internal variable. Even though attach_old_cs can be counted as a reference to the old cpuset, like cpuset_attach_old_cs, it is strictly used only for communication between cpuset_can_attach() and cpuset_attach() within the same task migration session where cgroup_mutex will be held throughout. So no actual reference counting will be performed. Signed-off-by: Waiman Long --- include/linux/sched.h | 3 +++ kernel/cgroup/cpuset.c | 13 ++++++------- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 004e6d56a499..9b6bb1603592 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -63,6 +63,7 @@ struct bpf_run_ctx; struct bpf_net_context; struct capture_control; struct cfs_rq; +struct cpuset; struct fs_struct; struct futex_pi_state; struct io_context; @@ -1317,6 +1318,8 @@ struct task_struct { /* Sequence number to catch updates: */ seqcount_spinlock_t mems_allowed_seq; int cpuset_mem_spread_rotor; + /* Old cpuset to be used in cpuset_attach() */ + struct cpuset *attach_old_cs; #endif #ifdef CONFIG_CGROUPS /* Control Group info protected by css_set_lock: */ diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 0d01b66f464d..fc632370d07c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2968,7 +2968,6 @@ static int update_prstate(struct cpuset *cs, int new_= prs) * cpuset_can_attach() and cpuset_attach() specific internal data * Protected by cpuset_mutex */ -static struct cpuset *cpuset_attach_old_cs; static bool attach_cpus_updated; static bool attach_mems_updated; =20 @@ -3052,9 +3051,7 @@ static int cpuset_can_attach(struct cgroup_taskset *t= set) bool setsched_check; int ret; =20 - /* used later by cpuset_attach() */ - cpuset_attach_old_cs =3D task_cs(cgroup_taskset_first(tset, &css)); - oldcs =3D cpuset_attach_old_cs; + oldcs =3D task_cs(cgroup_taskset_first(tset, &css)); cs =3D css_cs(css); =20 mutex_lock(&cpuset_mutex); @@ -3075,6 +3072,8 @@ static int cpuset_can_attach(struct cgroup_taskset *t= set) goto out_unlock; } =20 + /* Save a copy of oldcs to be used later in cpuset_attach() */ + task->attach_old_cs =3D oldcs; if (dl_task(task)) { /* * Count all migrating DL tasks for cpuset task accounting. @@ -3156,11 +3155,11 @@ static void cpuset_attach(struct cgroup_taskset *ts= et) struct task_struct *task; struct task_struct *leader; struct cgroup_subsys_state *css; - struct cpuset *cs; - struct cpuset *oldcs =3D cpuset_attach_old_cs; + struct cpuset *cs, *oldcs; bool queue_task_work =3D false; =20 - cgroup_taskset_first(tset, &css); + task =3D cgroup_taskset_first(tset, &css); + oldcs =3D task->attach_old_cs; cs =3D css_cs(css); =20 lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ --=20 2.54.0 From nobody Mon May 25 08:11:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 02AC223D7C2 for ; Sat, 16 May 2026 04:26:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905587; cv=none; b=HuJc6bQpdduvfT2D57JrsDHhgXTW64tjYuitxiMzXDMyB/s4VozAaiP7RBjoRMREsoDkloB+5HEYdxDmENKx5I7i3LSM6ZiSWful9mYFF33IbZxGK2gjsmOpy8JH7tJeS254r0E5jhc0728GDgdBDqfH9sCkrL9qQs6MdbDrLX0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905587; c=relaxed/simple; bh=RYqAPb+UVL2uAvnCHcK1rHa1Syzq493FwavoVozR4Nw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RLNjJfjG5t1PzKeyxzjaMFIpcyn8j5mvBNvlhhDbJiI2srPRaMBeBTeGBL1mf96l96dZIeSIbOVSDgB2jwnp2OzhCGmJLHf7UMTkjnyoETNAW9etdycwt0ES+yeNq+K2dqdQxVMvcigo+BX+NKm0C5N3Yh2kxRrH3ljH7cPGfho= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Cmv14cs4; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Cmv14cs4" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778905585; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=46zhY4m0IBo27ES4iopox1tGyl+dgQiIRjzfl6S0V+g=; b=Cmv14cs4764fne6M95erBChHJ9toCqj64Lkrjpv/EPWkYwTIuyc9WHeCttUZ0H4/HthAM5 3YOf4+pdXmAUSuSU8CeDujZeuJcv5BxEYaxnjupM2C1c0EWvVSym5a2iSU5OmN3xz600+6 nTCUjnyxE9W/kXjcJizxk0D1RdbTyEE= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-643-7UHSvjMBMsi6VG5_EIAS0w-1; Sat, 16 May 2026 00:25:43 -0400 X-MC-Unique: 7UHSvjMBMsi6VG5_EIAS0w-1 X-Mimecast-MFC-AGG-ID: 7UHSvjMBMsi6VG5_EIAS0w_1778905533 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9826618002C7; Sat, 16 May 2026 04:25:32 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.2.16.156]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 04DA51803A91; Sat, 16 May 2026 04:25:27 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Waiman Long Subject: [PATCH cgroup/for-next v2 4/5] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task() Date: Sat, 16 May 2026 00:24:47 -0400 Message-ID: <20260516042448.698216-5-longman@redhat.com> In-Reply-To: <20260516042448.698216-1-longman@redhat.com> References: <20260516042448.698216-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The cpuset_attach_task() was introduced in commit 42a11bf5c543 ("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly") to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like moving a task from one cpuset into another one. That commits didn't move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader into cpuset_attach_task(). When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new task is its own group leader. So it is still not equivalent to moving task between cpusets in this case. Make CLONE_INTO_CGROUP behaves more close to cpuset_attach() by moving the mpol_rebind_mm() and cpuset_migrate_mm() calls inside cpuset_attach_task(). Besides, the original code use cpuset_attach_nodemask_to for both nodemask returned by guarantee_online_mems() used only by cpuset_change_task_nodemask() and cs->effective_mems in all other cases. Such dual use is now impractical by merging the two task iteration loops into one. So keep cpuset_attach_nodemask_to for the nodemask returned by guarantee_online_mems() and reference cs->effective_mems directly in all the other cases. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 82 ++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 39 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index fc632370d07c..ab9fcc001f79 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3130,9 +3130,12 @@ static void cpuset_cancel_attach(struct cgroup_tasks= et *tset) */ static cpumask_var_t cpus_attach; static nodemask_t cpuset_attach_nodemask_to; +static bool queue_task_work; =20 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) { + struct mm_struct *mm; + lockdep_assert_cpuset_lock_held(); =20 if (cs !=3D &top_cpuset) @@ -3146,17 +3149,48 @@ static void cpuset_attach_task(struct cpuset *cs, s= truct task_struct *task) */ WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); =20 + if (cpuset_v2() && !attach_mems_updated) + return; + cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); cpuset1_update_task_spread_flags(cs, task); + + if (task !=3D task->group_leader) + return; + + /* + * Change mm for threadgroup leader. This is expensive and may + * sleep and should be moved outside migration path proper. + */ + mm =3D get_task_mm(task); + if (mm) { + struct cpuset *oldcs =3D task->attach_old_cs; + + mpol_rebind_mm(mm, &cs->effective_mems); + + /* + * old_mems_allowed is the same with mems_allowed + * here, except if this task is being moved + * automatically due to hotplug. In that case + * @mems_allowed has been updated and is empty, so + * @old_mems_allowed is the right nodesets that we + * migrate mm from. + */ + if (oldcs && is_memory_migrate(cs)) { + cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, + &cs->effective_mems); + queue_task_work =3D true; + } else { + mmput(mm); + } + } } =20 static void cpuset_attach(struct cgroup_taskset *tset) { struct task_struct *task; - struct task_struct *leader; struct cgroup_subsys_state *css; struct cpuset *cs, *oldcs; - bool queue_task_work =3D false; =20 task =3D cgroup_taskset_first(tset, &css); oldcs =3D task->attach_old_cs; @@ -3164,6 +3198,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) =20 lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ mutex_lock(&cpuset_mutex); + queue_task_work =3D false; =20 /* * In the default hierarchy, enabling cpuset in the child cgroups @@ -3171,53 +3206,18 @@ static void cpuset_attach(struct cgroup_taskset *ts= et) * in effective cpus and mems. In that case, we can optimize out * by skipping the task iteration and update. */ - if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) { - cpuset_attach_nodemask_to =3D cs->effective_mems; + if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) goto out; - } =20 guarantee_online_mems(cs, &cpuset_attach_nodemask_to); =20 cgroup_taskset_for_each(task, css, tset) cpuset_attach_task(cs, task); =20 - /* - * Change mm for all threadgroup leaders. This is expensive and may - * sleep and should be moved outside migration path proper. Skip it - * if there is no change in effective_mems and CS_MEMORY_MIGRATE is - * not set. - */ - cpuset_attach_nodemask_to =3D cs->effective_mems; - if (!is_memory_migrate(cs) && !attach_mems_updated) - goto out; - - cgroup_taskset_for_each_leader(leader, css, tset) { - struct mm_struct *mm =3D get_task_mm(leader); - - if (mm) { - mpol_rebind_mm(mm, &cpuset_attach_nodemask_to); - - /* - * old_mems_allowed is the same with mems_allowed - * here, except if this task is being moved - * automatically due to hotplug. In that case - * @mems_allowed has been updated and is empty, so - * @old_mems_allowed is the right nodesets that we - * migrate mm from. - */ - if (is_memory_migrate(cs)) { - cpuset_migrate_mm(mm, &oldcs->old_mems_allowed, - &cpuset_attach_nodemask_to); - queue_task_work =3D true; - } else - mmput(mm); - } - } - out: if (queue_task_work) schedule_flush_migrate_mm(); - cs->old_mems_allowed =3D cpuset_attach_nodemask_to; + cs->old_mems_allowed =3D cs->effective_mems; =20 if (cs->nr_migrate_dl_tasks) { cs->nr_deadline_tasks +=3D cs->nr_migrate_dl_tasks; @@ -3667,7 +3667,11 @@ static void cpuset_fork(struct task_struct *task) /* CLONE_INTO_CGROUP */ mutex_lock(&cpuset_mutex); guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + /* Assume CPUs and memory nodes are updated */ + attach_cpus_updated =3D attach_mems_updated =3D true; + task->attach_old_cs =3D task_cs(current); cpuset_attach_task(cs, task); + attach_cpus_updated =3D attach_mems_updated =3D false; =20 dec_attach_in_progress_locked(cs); mutex_unlock(&cpuset_mutex); --=20 2.54.0 From nobody Mon May 25 08:11:32 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69FDD29BDBD for ; Sat, 16 May 2026 04:25:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905547; cv=none; b=Rk8vVeXUaWKrefik0Phr3n/6BvZyCBdEkDRKKURRxsOZw6xb4QuUizl+WHo9uE7vm/HXKt7NIl8uln1AsCMKrEtW29+iOnpv1O+EjwV2cKrIbH84JxAeR1/hQ5mcgc6s6leyqz4m5tZPWSa98zAg023akzSDEvZWwVzHAAfK0aw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778905547; c=relaxed/simple; bh=BrNr44eJ0V4sd+wlfyBCXpAKGOhtuM/q+bzNSYydEgE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZJmtx8NtiW3JS/IuTn6nkDB9MZTzkm9YFAz+F8p1LWdPOmYNlL2BIwfABSfiYLaSp1jF3ZE0RWKz1QEJfcYUnQJihCrFV3stQBjVdliNLZhhyg1w65dJ4QzJ3zTiQKbeSD/gZngkEeDIcJomDxxSB8CSArAXG2FViDLLtWs7Ktc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=SKFMnDf5; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SKFMnDf5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778905544; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=I6CvuZz8murnONVQAuj7VCV4d6Hs8QxuA0Pg+Rnfwng=; b=SKFMnDf5Kj+pG5d0lvtWgTr4Y1WyULHlQ02glQ1w3LksL5rmSkLj7zapZhGIsyi+SDBgdk Yygrc3x6mP/biC/duN7HYrj6Sk+k4PWIpG3TGmWWrgMqyENKWd5gwCZGbFojpm7RH+5Y0t VpXyYVZLOrOWPmhhmYTsRBvIGyhINeM= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-670-HgOu19XHMcWKW_Rmjtzf0g-1; Sat, 16 May 2026 00:25:40 -0400 X-MC-Unique: HgOu19XHMcWKW_Rmjtzf0g-1 X-Mimecast-MFC-AGG-ID: HgOu19XHMcWKW_Rmjtzf0g_1778905537 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8C8F81956046; Sat, 16 May 2026 04:25:37 +0000 (UTC) Received: from llong-thinkpadp16vgen1.westford.csb (unknown [10.2.16.156]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 0BAA71800576; Sat, 16 May 2026 04:25:32 +0000 (UTC) From: Waiman Long To: Chen Ridong , Tejun Heo , Johannes Weiner , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Aaron Tomlin , Waiman Long Subject: [PATCH cgroup/for-next v2 5/5] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach() Date: Sat, 16 May 2026 00:24:48 -0400 Message-ID: <20260516042448.698216-6-longman@redhat.com> In-Reply-To: <20260516042448.698216-1-longman@redhat.com> References: <20260516042448.698216-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" With cgroup v2, the cgroup_taskset structure passed into the cgroup can_attach() and attach() methods can contain task migration data with multiple destination or source cpusets when the cpuset controller is enabled or disabled respectively. Since cpuset is threaded, another possible way to cause many-to-one migration is to move the whole process with multiple threads in different cpuset enabled threaded cgroups into another cpuset enabled cgroup. Alternatively, multiple processs from different cpusets can be written into cgroup.proc as a single operation. The current cpuset_can_attach() and cpuset_attach() functions still expect task migration is from one source cpuset to one destination cpuset. This has been the case since cpuset was enabled for cgroup v2 in commit 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy"). This problem is less an issue when enabling the cpuset controller as all the newly created child cpusets will have exactly the same set of CPUs and memory nodes except when deadline tasks are involved in migration as the deadline task accounting data can be off. It can be more problematic when the cpuset controller is disabled as their set of CPUs and memory nodes may differ from their parent or with the moving of multi-threaded process from different threaded cgroups. Fix that by tracking the set of source (old) and destination cpusets in singly linked lists and iterating them all to properly update the internal data. Also keep the current cs and oldcs variables up-to-date with the css and task iterators. cpuset_attach_old_cs is now dropped as the old cpusets are now being tracked. To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both the source and destination cpusets are decremented/incremented with their values added to nr_deadline_tasks when the migration is successful. Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy= ") Signed-off-by: Waiman Long --- kernel/cgroup/cpuset-internal.h | 6 + kernel/cgroup/cpuset.c | 212 +++++++++++++++++++++++--------- 2 files changed, 161 insertions(+), 57 deletions(-) diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-interna= l.h index f7aaf01f7cd5..4c2772a7fd5e 100644 --- a/kernel/cgroup/cpuset-internal.h +++ b/kernel/cgroup/cpuset-internal.h @@ -161,6 +161,12 @@ struct cpuset { */ bool remote_partition; =20 + /* + * cpuset_can_attach() and cpuset_attach() specific data + */ + bool attach_node_in_llist; + struct llist_node attach_node; + /* * number of SCHED_DEADLINE tasks attached to this cpuset, so that we * know when to rebuild associated root domain bandwidth information. diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index ab9fcc001f79..1b6eb8cf0bcd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -37,6 +37,7 @@ #include #include #include +#include =20 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key); DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key); @@ -2968,6 +2969,8 @@ static int update_prstate(struct cpuset *cs, int new_= prs) * cpuset_can_attach() and cpuset_attach() specific internal data * Protected by cpuset_mutex */ +static LLIST_HEAD(src_cs_head); +static LLIST_HEAD(dst_cs_head); static bool attach_cpus_updated; static bool attach_mems_updated; =20 @@ -2980,9 +2983,10 @@ static bool attach_mems_updated; * Also set the boolean flag passed in by @psetsched depending on if * security_task_setscheduler() call is needed and @oldcs is not NULL. */ -static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs, - bool *psetsched) +static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs= , bool *psetsched) { + bool cpu_match, mem_match; + if (cpumask_empty(cs->effective_cpus) || (!is_in_v2_mode() && nodes_empty(cs->mems_allowed))) return -ENOSPC; @@ -2993,15 +2997,34 @@ static int cpuset_can_attach_check(struct cpuset *c= s, struct cpuset *oldcs, /* * Update attach specific data */ - attach_cpus_updated =3D !cpumask_equal(cs->effective_cpus, oldcs->effecti= ve_cpus); - attach_mems_updated =3D !nodes_equal(cs->effective_mems, oldcs->effective= _mems); + if (!cs->attach_node_in_llist) { + llist_add(&cs->attach_node, &dst_cs_head); + cs->attach_node_in_llist =3D true; + } + if (!oldcs->attach_node_in_llist) { + llist_add(&oldcs->attach_node, &src_cs_head); + oldcs->attach_node_in_llist =3D true; + } + + cpu_match =3D cpumask_equal(cs->effective_cpus, oldcs->effective_cpus); + mem_match =3D nodes_equal(cs->effective_mems, oldcs->effective_mems); + + /* + * Set the updated flags whenever there is a mismatch in any of the + * src/dst pairs. + */ + if (!attach_cpus_updated) + attach_cpus_updated =3D !cpu_match; + + if (!attach_mems_updated) + attach_mems_updated =3D !mem_match; =20 /* * Skip rights over task setsched check in v2 when nothing changes, * migration permission derives from hierarchy ownership in * cgroup_procs_write_permission()). */ - *psetsched =3D !cpuset_v2() || attach_cpus_updated || attach_mems_updated; + *psetsched =3D !cpuset_v2() || !cpu_match || !mem_match; =20 /* * A v1 cpuset with tasks will have no CPU left only when CPU hotplug @@ -3016,33 +3039,105 @@ static int cpuset_can_attach_check(struct cpuset *= cs, struct cpuset *oldcs, return 0; } =20 -static int cpuset_reserve_dl_bw(struct cpuset *cs) +/* + * If reset_dl_bw is set, reset the previous dl_bw_alloc() call. Otherwise, + * update nr_deadline_tasks according to nr_migrate_dl_tasks in both source + * and destination cpusets. + */ +static void clear_attach_data(bool reset_dl_bw) { + struct cpuset *cs, *next; + + llist_for_each_entry_safe(cs, next, src_cs_head.first, attach_node) { + cs->attach_node.next =3D NULL; + cs->attach_node_in_llist =3D false; + if (cs->nr_migrate_dl_tasks && !reset_dl_bw) + cs->nr_deadline_tasks +=3D cs->nr_migrate_dl_tasks; + cs->nr_migrate_dl_tasks =3D 0; + } + + llist_for_each_entry_safe(cs, next, dst_cs_head.first, attach_node) { + cs->attach_node.next =3D NULL; + cs->attach_node_in_llist =3D false; + if (reset_dl_bw && cs->dl_bw_cpu >=3D 0) + dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw); + if (cs->nr_migrate_dl_tasks && !reset_dl_bw) + cs->nr_deadline_tasks +=3D cs->nr_migrate_dl_tasks; + cs->nr_migrate_dl_tasks =3D 0; + cs->sum_migrate_dl_bw =3D 0; + cs->dl_bw_cpu =3D -1; + } + + src_cs_head.first =3D NULL; + dst_cs_head.first =3D NULL; + attach_cpus_updated =3D false; + attach_mems_updated =3D false; +} + +static int cpuset_reserve_dl_bw(void) +{ + struct cpuset *cs; int cpu, ret; =20 - if (!cs->sum_migrate_dl_bw) - return 0; + llist_for_each_entry(cs, dst_cs_head.first, attach_node) { + if (!cs->sum_migrate_dl_bw) + continue; =20 - cpu =3D cpumask_any_and(cpu_active_mask, cs->effective_cpus); - if (unlikely(cpu >=3D nr_cpu_ids)) - return -EINVAL; + cpu =3D cpumask_any_and(cpu_active_mask, cs->effective_cpus); + if (unlikely(cpu >=3D nr_cpu_ids)) + return -EINVAL; =20 - ret =3D dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); - if (ret) - return ret; + ret =3D dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); + if (ret) + return ret; =20 - cs->dl_bw_cpu =3D cpu; + cs->dl_bw_cpu =3D cpu; + } return 0; } =20 -static void reset_migrate_dl_data(struct cpuset *cs) +static void set_attach_in_progress(void) +{ + struct cpuset *cs; + + /* + * Mark attach is in progress. This makes validate_change() fail + * changes which zero cpus/mems_allowed. + */ + llist_for_each_entry(cs, dst_cs_head.first, attach_node) + cs->attach_in_progress++; +} + +static void reset_attach_in_progress(void) { - cs->nr_migrate_dl_tasks =3D 0; - cs->sum_migrate_dl_bw =3D 0; - cs->dl_bw_cpu =3D -1; + struct cpuset *cs; + + llist_for_each_entry(cs, dst_cs_head.first, attach_node) + dec_attach_in_progress_locked(cs); } =20 -/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held= */ +/* + * Called by cgroups to determine if a cpuset is usable; cpuset_mutex held. + * + * With cgroup v2, enabling of cpuset controller in a cgroup subtree can + * cause @tset to contain task migration data from one parent cpuset to mu= ltiple + * child cpusets. Not much is needed to be done here other than tracking t= he + * number of DL tasks in each cpuset as the CPUs and memory nodes of the c= hild + * cpusets are exactly the same as the parent. + * + * Conversely, disabling of cpuset controller can cause @tset to contain t= ask + * migration data from multiple child cpusets to one parent cpuset. Here, = the + * CPUs and memory nodes of the child cpusets may be different from the pa= rent, + * but must be a subset of its parent. + * + * Another possible many-to-one migration is the moving of the whole + * multithreaded process with threads in different cpusets to another cpus= et. + * Alternatively, multiple processes from multiple cpusets can be moved to + * another cpuset in a single operation. + * + * For all other use cases including cgroup v1, @tset task migration data + * should be from one source cpuset to one destination cpuset. + */ static int cpuset_can_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; @@ -3062,6 +3157,16 @@ static int cpuset_can_attach(struct cgroup_taskset *= tset) goto out_unlock; =20 cgroup_taskset_for_each(task, css, tset) { + struct cpuset *newcs =3D css_cs(css); + struct cpuset *new_oldcs =3D task_cs(task); + + if ((newcs !=3D cs) || (new_oldcs !=3D oldcs)) { + cs =3D newcs; + oldcs =3D new_oldcs; + ret =3D cpuset_can_attach_check(cs, oldcs, &setsched_check); + if (ret) + goto out_unlock; + } ret =3D task_can_attach(task); if (ret) goto out_unlock; @@ -3081,23 +3186,19 @@ static int cpuset_can_attach(struct cgroup_taskset = *tset) * contribute to sum_migrate_dl_bw. */ cs->nr_migrate_dl_tasks++; + oldcs->nr_migrate_dl_tasks--; if (dl_task_needs_bw_move(task, cs->effective_cpus)) cs->sum_migrate_dl_bw +=3D task->dl.dl_bw; } } =20 - ret =3D cpuset_reserve_dl_bw(cs); + ret =3D cpuset_reserve_dl_bw(); =20 out_unlock: - if (ret) { - reset_migrate_dl_data(cs); - } else { - /* - * Mark attach is in progress. This makes validate_change() fail - * changes which zero cpus/mems_allowed. - */ - cs->attach_in_progress++; - } + if (ret) + clear_attach_data(true); + else + set_attach_in_progress(); =20 mutex_unlock(&cpuset_mutex); return ret; @@ -3112,14 +3213,8 @@ static void cpuset_cancel_attach(struct cgroup_tasks= et *tset) cs =3D css_cs(css); =20 mutex_lock(&cpuset_mutex); - dec_attach_in_progress_locked(cs); - - if (cs->dl_bw_cpu >=3D 0) - dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw); - - if (cs->nr_migrate_dl_tasks) - reset_migrate_dl_data(cs); - + reset_attach_in_progress(); + clear_attach_data(true); mutex_unlock(&cpuset_mutex); } =20 @@ -3190,43 +3285,46 @@ static void cpuset_attach(struct cgroup_taskset *ts= et) { struct task_struct *task; struct cgroup_subsys_state *css; - struct cpuset *cs, *oldcs; + struct cpuset *cs; =20 - task =3D cgroup_taskset_first(tset, &css); - oldcs =3D task->attach_old_cs; + cgroup_taskset_first(tset, &css); cs =3D css_cs(css); - lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ mutex_lock(&cpuset_mutex); queue_task_work =3D false; =20 /* * In the default hierarchy, enabling cpuset in the child cgroups - * will trigger a number of cpuset_attach() calls with no change - * in effective cpus and mems. In that case, we can optimize out - * by skipping the task iteration and update. + * will trigger a cpuset_attach() call with no change in effective cpus + * and mems. In that case, we can optimize out by skipping the task + * iteration and update, but the destination cpuset list is iterated to + * set old_mems_sllowed. */ - if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) + if (cpuset_v2() && !attach_cpus_updated && !attach_mems_updated) { + llist_for_each_entry(cs, dst_cs_head.first, attach_node) + cs->old_mems_allowed =3D cs->effective_mems; goto out; + } =20 guarantee_online_mems(cs, &cpuset_attach_nodemask_to); =20 - cgroup_taskset_for_each(task, css, tset) + cgroup_taskset_for_each(task, css, tset) { + struct cpuset *newcs =3D css_cs(css); + + if (newcs !=3D cs) { + cs->old_mems_allowed =3D cs->effective_mems; + cs =3D newcs; + guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + } cpuset_attach_task(cs, task); + } =20 -out: if (queue_task_work) schedule_flush_migrate_mm(); cs->old_mems_allowed =3D cs->effective_mems; - - if (cs->nr_migrate_dl_tasks) { - cs->nr_deadline_tasks +=3D cs->nr_migrate_dl_tasks; - oldcs->nr_deadline_tasks -=3D cs->nr_migrate_dl_tasks; - reset_migrate_dl_data(cs); - } - - dec_attach_in_progress_locked(cs); - +out: + reset_attach_in_progress(); + clear_attach_data(false); mutex_unlock(&cpuset_mutex); } =20 --=20 2.54.0