From nobody Mon Jun 8 07:21:48 2026 Received: from out-181.mta1.migadu.com (out-181.mta1.migadu.com [95.215.58.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C37B0385515 for ; Wed, 3 Jun 2026 10:26:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780482389; cv=none; b=KEsfXYWi7ezdI9ZMmTgtOcOovfOAiUhUQ6hW6yBb1LyMoQarfVJWY12uCJQcDI+eKZEuOFhyJYZpoBtCnc7Ojjy8OObqhhoqVKz7SH2ZPS7i+KmHiSmiSc7PcrmvADp917cQcxQT900VrOrmOQa9iQQdC5urpNrdDXRSoDn56oM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780482389; c=relaxed/simple; bh=BLqr74xjpuwdiSpGZBuqKJRFFuOc1rfzNLe7NU23ZB4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LqDYOLKXEcsSVUbnkT3N5U95QejigfkPzMSeuTgNvPmGe8JGpBVUIQCZlB9BC7ssJFjcE/Ab20Ots/J/isO0Nd5PBEnNwhf7HCX6SDLFH0hr+41S1/vPYUEygGACHAZb1Dc8M3rp95vncOAK3vmvDCPK+GTzFVa+xTEP03triBY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=WaiaGb4a; arc=none smtp.client-ip=95.215.58.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="WaiaGb4a" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780482386; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UzGBXPjMa9UIylrF5giGla3THJrUyIwhRy78pI2Guc0=; b=WaiaGb4aMWHkOctwkCioOZdbrjP4hbH0DyAFgwXjPhc9vSg05mdDvtyaF/mzs7dcU2ikNe TaT/aqFAKEfwK5Wvqr96tJst7VoWhx7Zt5bUmJ3QME16l+T0XNy1IchsmbwmpYkTWH6fgM GlKvyNIt7FyjLuLE/mJMYInEjRWpcAk= From: Ridong Chen To: Waiman Long Cc: cgroups@vger.kernel.org, Tejun Heo , Johannes Weiner , ridong.chen@linux.dev, linux-kernel@vger.kernel.org Subject: [PATCH] cgroup/cpuset: Support multiple source/destination cpusets using pids pattern Date: Wed, 3 Jun 2026 18:26:03 +0800 Message-ID: <20260603102604.177503-1-ridong.chen@linux.dev> In-Reply-To: <20260602023203.248077-7-longman@redhat.com> References: <20260602023203.248077-7-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" The current cpuset_can_attach() and cpuset_attach() functions assume task migration is from one source cpuset to one destination cpuset. This can be wrong in several scenarios: - Moving a multi-threaded process with threads in different cpusets - Disabling the cpuset controller (many children to one parent) - Enabling the cpuset controller (one parent to many children) Fix this by adopting the pids subsystem's per-task accounting pattern. In cpuset_can_attach(), use task_cs(task) to get the correct source cpuset for each task (like pids_can_attach uses task_css), adjust nr_deadline_tasks and reserve DL bandwidth per-task, and increment attach_in_progress per-task on the destination cpuset. In cpuset_attach(), handle destination cpuset changes within the task iteration loop. A shared helper cpuset_undo_attach() reverses the per-task operations for both partial rollback in cpuset_can_attach() and full reversal in cpuset_cancel_attach(). When multiple source cpusets are detected in can_attach(), set attach_many_sources so that cpuset_attach() forces cpus_updated and mems_updated to true, ensuring all tasks get properly updated regardless of which source cpuset cpuset_attach_old_cs points to. This eliminates the need for nr_migrate_dl_tasks, sum_migrate_dl_bw, and dl_bw_cpu fields in struct cpuset. Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy= ") Signed-off-by: Ridong Chen --- kernel/cgroup/cpuset-internal.h | 8 -- kernel/cgroup/cpuset.c | 177 ++++++++++++++++---------------- 2 files changed, 89 insertions(+), 96 deletions(-) diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-interna= l.h index f7aaf01f7cd5..601e38b3c75b 100644 --- a/kernel/cgroup/cpuset-internal.h +++ b/kernel/cgroup/cpuset-internal.h @@ -166,14 +166,6 @@ struct cpuset { * know when to rebuild associated root domain bandwidth information. */ int nr_deadline_tasks; - int nr_migrate_dl_tasks; - /* DL bandwidth that needs destination reservation for this attach. */ - u64 sum_migrate_dl_bw; - /* - * CPU used for temporary DL bandwidth allocation during attach; - * -1 if no DL bandwidth was allocated in the current attach. - */ - int dl_bw_cpu; =20 /* Invalid partition error code, not lock protected */ enum prs_errcode prs_err; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e52a5a40d607..be222eb6078c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -288,7 +288,6 @@ struct cpuset top_cpuset =3D { .flags =3D BIT(CS_CPU_EXCLUSIVE) | BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE), .partition_root_state =3D PRS_ROOT, - .dl_bw_cpu =3D -1, }; =20 /** @@ -580,8 +579,6 @@ static struct cpuset *dup_or_alloc_cpuset(struct cpuset= *cs) if (!trial) return NULL; =20 - trial->dl_bw_cpu =3D -1; - /* Setup cpumask pointer array */ cpumask_var_t *pmask[4] =3D { &trial->cpus_allowed, @@ -2984,6 +2981,7 @@ static int update_prstate(struct cpuset *cs, int new_= prs) static struct cpuset *cpuset_attach_old_cs; static bool attach_cpus_updated; static bool attach_mems_updated; +static bool attach_many_sources; =20 /* * Check to see if a cpuset can accept a new task @@ -3026,30 +3024,36 @@ static int cpuset_can_attach_check(struct cpuset *c= s, struct cpuset *oldcs, return 0; } =20 -static int cpuset_reserve_dl_bw(struct cpuset *cs) +/* + * Reverse per-task operations done in cpuset_can_attach(). + * If @stop_at is non-NULL, only undo tasks before it (partial rollback). + * If @stop_at is NULL, undo all tasks (full reversal for cancel_attach). + * Must be called with cpuset_mutex held. + */ +static void cpuset_undo_attach(struct cgroup_taskset *tset, + struct task_struct *stop_at) { - int cpu, ret; - - if (!cs->sum_migrate_dl_bw) - return 0; - - cpu =3D cpumask_any_and(cpu_active_mask, cs->effective_cpus); - if (unlikely(cpu >=3D nr_cpu_ids)) - return -EINVAL; + struct cgroup_subsys_state *css; + struct task_struct *task; =20 - ret =3D dl_bw_alloc(cpu, cs->sum_migrate_dl_bw); - if (ret) - return ret; + cgroup_taskset_for_each(task, css, tset) { + struct cpuset *cs =3D css_cs(css); + struct cpuset *oldcs =3D task_cs(task); =20 - cs->dl_bw_cpu =3D cpu; - return 0; -} + if (task =3D=3D stop_at) + break; =20 -static void reset_migrate_dl_data(struct cpuset *cs) -{ - cs->nr_migrate_dl_tasks =3D 0; - cs->sum_migrate_dl_bw =3D 0; - cs->dl_bw_cpu =3D -1; + if (dl_task(task)) { + cs->nr_deadline_tasks--; + oldcs->nr_deadline_tasks++; + if (dl_task_needs_bw_move(task, cs->effective_cpus)) { + int cpu =3D cpumask_any_and(cpu_active_mask, + cs->effective_cpus); + dl_bw_free(cpu, task->dl.dl_bw); + } + } + dec_attach_in_progress_locked(cs); + } } =20 /* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held= */ @@ -3061,96 +3065,79 @@ static int cpuset_can_attach(struct cgroup_taskset = *tset) bool setsched_check; int ret; =20 - /* used later by cpuset_attach() */ cpuset_attach_old_cs =3D task_cs(cgroup_taskset_first(tset, &css)); oldcs =3D cpuset_attach_old_cs; cs =3D css_cs(css); =20 mutex_lock(&cpuset_mutex); + attach_many_sources =3D false; =20 - /* Check to see if task is allowed in the cpuset */ ret =3D cpuset_can_attach_check(cs, oldcs, &setsched_check); if (ret) goto out_unlock; =20 - /* - * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get - * the old_mems_allowed value. There are two ways that many-to-one - * cpuset migration can happen: - * 1) A multithread application with threads in different cpusets is - * wholely migrated to a new cpuset. - * 2) Disabling v2 cpuset controller will move all the tasks in child - * cpusets to the parent cpuset. - * - * In the former case, it is the mm setting of the group leader that - * really matters. So cpuset_attach_old_cs should track the oldcs of the - * group leader. It falls back to the oldcs of the first task if there - * is no group leader in the taskset. In the latter case, effective_mems - * of child cpusets must always be a subset of the parent. So no real - * page migration will be necessary no matter which child cpuset is - * selected as cpuset_attach_old_cs. - */ cgroup_taskset_for_each(task, css, tset) { + struct cpuset *newcs =3D css_cs(css); + struct cpuset *new_oldcs =3D task_cs(task); + + if (newcs !=3D cs || new_oldcs !=3D oldcs) { + if (new_oldcs !=3D oldcs) + attach_many_sources =3D true; + cs =3D newcs; + oldcs =3D new_oldcs; + ret =3D cpuset_can_attach_check(cs, oldcs, + &setsched_check); + if (ret) + goto out_rollback; + } + ret =3D task_can_attach(task); if (ret) - goto out_unlock; + goto out_rollback; =20 - /* Update cpuset_attach_old_cs to the latest group leader */ if (task =3D=3D task->group_leader) cpuset_attach_old_cs =3D task_cs(task); =20 if (setsched_check) { ret =3D security_task_setscheduler(task); if (ret) - goto out_unlock; + goto out_rollback; } =20 if (dl_task(task)) { - /* - * Count all migrating DL tasks for cpuset task accounting. - * Only tasks that need a root-domain bandwidth move - * contribute to sum_migrate_dl_bw. - */ - cs->nr_migrate_dl_tasks++; - if (dl_task_needs_bw_move(task, cs->effective_cpus)) - cs->sum_migrate_dl_bw +=3D task->dl.dl_bw; + cs->nr_deadline_tasks++; + oldcs->nr_deadline_tasks--; + + if (dl_task_needs_bw_move(task, cs->effective_cpus)) { + int cpu =3D cpumask_any_and(cpu_active_mask, + cs->effective_cpus); + if (unlikely(cpu >=3D nr_cpu_ids)) { + ret =3D -EINVAL; + goto out_rollback; + } + ret =3D dl_bw_alloc(cpu, task->dl.dl_bw); + if (ret) + goto out_rollback; + } } - } - - ret =3D cpuset_reserve_dl_bw(cs); =20 -out_unlock: - if (ret) { - reset_migrate_dl_data(cs); - } else { - /* - * Mark attach is in progress. This makes validate_change() fail - * changes which zero cpus/mems_allowed. - */ cs->attach_in_progress++; } =20 + goto out_unlock; + +out_rollback: + cpuset_undo_attach(tset, task); + +out_unlock: mutex_unlock(&cpuset_mutex); return ret; } =20 static void cpuset_cancel_attach(struct cgroup_taskset *tset) { - struct cgroup_subsys_state *css; - struct cpuset *cs; - - cgroup_taskset_first(tset, &css); - cs =3D css_cs(css); - mutex_lock(&cpuset_mutex); - dec_attach_in_progress_locked(cs); - - if (cs->dl_bw_cpu >=3D 0) - dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw); - - if (cs->nr_migrate_dl_tasks) - reset_migrate_dl_data(cs); - + cpuset_undo_attach(tset, NULL); mutex_unlock(&cpuset_mutex); } =20 @@ -3232,8 +3219,15 @@ static void cpuset_attach(struct cgroup_taskset *tse= t) mutex_lock(&cpuset_mutex); queue_task_work =3D false; =20 - attach_cpus_updated =3D !cpumask_equal(cs->effective_cpus, oldcs->effecti= ve_cpus); - attach_mems_updated =3D !nodes_equal(cs->effective_mems, oldcs->effective= _mems); + if (attach_many_sources) { + attach_cpus_updated =3D true; + attach_mems_updated =3D true; + } else { + attach_cpus_updated =3D !cpumask_equal(cs->effective_cpus, + oldcs->effective_cpus); + attach_mems_updated =3D !nodes_equal(cs->effective_mems, + oldcs->effective_mems); + } =20 /* * In the default hierarchy, enabling cpuset in the child cgroups @@ -3249,21 +3243,28 @@ static void cpuset_attach(struct cgroup_taskset *ts= et) guarantee_online_mems(cs, &cpuset_attach_nodemask_to); } =20 - cgroup_taskset_for_each(task, css, tset) + cgroup_taskset_for_each(task, css, tset) { + struct cpuset *newcs =3D css_cs(css); + + if (newcs !=3D cs) { + cs->old_mems_allowed =3D cpuset_attach_nodemask_to; + cs =3D newcs; + if (cpuset_v2()) + cpuset_attach_nodemask_to =3D cs->effective_mems; + else + guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + } cpuset_attach_task(cs, task); + } =20 out: if (queue_task_work) schedule_flush_migrate_mm(); cs->old_mems_allowed =3D cpuset_attach_nodemask_to; =20 - if (cs->nr_migrate_dl_tasks) { - cs->nr_deadline_tasks +=3D cs->nr_migrate_dl_tasks; - oldcs->nr_deadline_tasks -=3D cs->nr_migrate_dl_tasks; - reset_migrate_dl_data(cs); - } - - dec_attach_in_progress_locked(cs); + /* Decrement per-task attach_in_progress */ + cgroup_taskset_for_each(task, css, tset) + dec_attach_in_progress_locked(css_cs(css)); =20 mutex_unlock(&cpuset_mutex); } --=20 2.43.0