From nobody Thu Feb 12 23:17:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9036F139D1B for ; Wed, 5 Jun 2024 17:19:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717607961; cv=none; b=bY15YeoW7dFduGn59JLgX4etsWBJQh32Jbo8mrIATcpLtwkE+NzE7NwKx+nDgvvorw42u2KUEI/ApBG+j9krNIaOPm4I2KtP368eMS7yCmTrgRv5NyHmK4bGNy6CpDs5OaBvzKh7e+KtMrNQNEY1+uGxlDFFgK80jU7RVTxVAf8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717607961; c=relaxed/simple; bh=BKxZRYhWTUP+ZwtfxFFfpIbGweia8FSOZExsleG1QNc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=dMoMVdCLStNPZaTQW37tb5OGYcrBqSXb3//hbFETK3W3mnknIXJ0WLsJhm/sPhG8jC4T05F9uPIFum8rVa98SPvuFcAI2re384gGMvZ0afQLUbg+Fs4ghDt9h5TGjyjqNTLfSDHg/R8mPCaj6LaBrfaqkvvzIMNSuIsflI8P8Rk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Ft0+F9aj; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Ft0+F9aj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1717607958; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YoV1XwOZ7Cwl9H/g9F44sqIaQzIwrjfnCXC0mFAkhaw=; b=Ft0+F9aj7TFApzggX/zTPFUL8EJCVGeOPjhvB3UmdsxTRGasldSp4ENkIxhvN6OpAp0fve +2+O50/Ho0MaKqxlaLwT4gd5Dh5O/Cj1iH/+StfjecatzwORR5v+/xbgeU+8j/xTuDsGf1 gmjCq5jeNxXevfMbPcsPi3j22w3NFug= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-86-1f_DfBYAM2CqQNOyI9fQYw-1; Wed, 05 Jun 2024 13:19:14 -0400 X-MC-Unique: 1f_DfBYAM2CqQNOyI9fQYw-1 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4F6681954B07; Wed, 5 Jun 2024 17:19:12 +0000 (UTC) Received: from llong.com (unknown [10.22.33.216]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 951B5195606C; Wed, 5 Jun 2024 17:19:10 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Xavier , Waiman Long Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem Date: Wed, 5 Jun 2024 13:18:57 -0400 Message-Id: <20240605171858.1323464-2-longman@redhat.com> In-Reply-To: <20240605171858.1323464-1-longman@redhat.com> References: <20240605171858.1323464-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Content-Type: text/plain; charset="utf-8" Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"), a remote partition can be created underneath a non-partition root cpuset as long as its exclusive_cpus are set to distribute exclusive CPUs down to its children. The generate_sched_domains() function, however, doesn't take into account this new behavior and hence will fail to create the sched domain needed for a remote root (non-isolated) partition. There are two issues related to remote partition support. First of all, generate_sched_domains() has a fast path that is activated if root_load_balance is true and top_cpuset.nr_subparts is non-zero. The later condition isn't quite correct for remote partitions as nr_subparts just shows the number of local child partitions underneath it. There can be no local child partition under top_cpuset even if there are remote partitions further down the hierarchy. Fix that by checking for subpartitions_cpus which contains exclusive CPUs allocated to both local and remote partitions. Secondly, the valid partition check for subtree skipping in the csa[] generation loop isn't enough as remote partition does not need to have a partition root parent. Fix this problem by breaking csa[] array generation loop of generate_sched_domains() into v1 and v2 specific parts and checking a cpuset's exclusive_cpus before skipping its subtree in the v2 case. Also simplify generate_sched_domains() for cgroup v2 as only non-isolating partition roots should be included in building the cpuset array and none of the v1 scheduling attributes other than a different way to create an isolated partition are supported. Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++---------- 1 file changed, 42 insertions(+), 13 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index f9b97f65e204..fb71d710a603 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -169,7 +169,7 @@ struct cpuset { /* for custom sched domain */ int relax_domain_level; =20 - /* number of valid sub-partitions */ + /* number of valid local child partitions */ int nr_subparts; =20 /* partition root state */ @@ -957,13 +957,14 @@ static int generate_sched_domains(cpumask_var_t **dom= ains, int nslot; /* next empty doms[] struct cpumask slot */ struct cgroup_subsys_state *pos_css; bool root_load_balance =3D is_sched_load_balance(&top_cpuset); + bool cgrpv2 =3D cgroup_subsys_on_dfl(cpuset_cgrp_subsys); =20 doms =3D NULL; dattr =3D NULL; csa =3D NULL; =20 /* Special case for the 99% of systems with one, full, sched domain */ - if (root_load_balance && !top_cpuset.nr_subparts) { + if (root_load_balance && cpumask_empty(subpartitions_cpus)) { single_root_domain: ndoms =3D 1; doms =3D alloc_sched_domains(ndoms); @@ -992,16 +993,18 @@ static int generate_sched_domains(cpumask_var_t **dom= ains, cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { if (cp =3D=3D &top_cpuset) continue; + + if (cgrpv2) + goto v2; + /* + * v1: * Continue traversing beyond @cp iff @cp has some CPUs and * isn't load balancing. The former is obvious. The * latter: All child cpusets contain a subset of the * parent's cpus, so just skip them, and then we call * update_domain_attr_tree() to calc relax_domain_level of * the corresponding sched domain. - * - * If root is load-balancing, we can skip @cp if it - * is a subset of the root's effective_cpus. */ if (!cpumask_empty(cp->cpus_allowed) && !(is_sched_load_balance(cp) && @@ -1009,16 +1012,28 @@ static int generate_sched_domains(cpumask_var_t **d= omains, housekeeping_cpumask(HK_TYPE_DOMAIN)))) continue; =20 - if (root_load_balance && - cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) - continue; - if (is_sched_load_balance(cp) && !cpumask_empty(cp->effective_cpus)) csa[csn++] =3D cp; =20 - /* skip @cp's subtree if not a partition root */ - if (!is_partition_valid(cp)) + /* skip @cp's subtree */ + pos_css =3D css_rightmost_descendant(pos_css); + continue; + +v2: + /* + * Only valid partition roots that are not isolated and with + * non-empty effective_cpus will be saved into csn[]. + */ + if ((cp->partition_root_state =3D=3D PRS_ROOT) && + !cpumask_empty(cp->effective_cpus)) + csa[csn++] =3D cp; + + /* + * Skip @cp's subtree if not a partition root and has no + * exclusive CPUs to be granted to child cpusets. + */ + if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus)) pos_css =3D css_rightmost_descendant(pos_css); } rcu_read_unlock(); @@ -1072,6 +1087,20 @@ static int generate_sched_domains(cpumask_var_t **do= mains, dattr =3D kmalloc_array(ndoms, sizeof(struct sched_domain_attr), GFP_KERNEL); =20 + /* + * Cgroup v2 doesn't support domain attributes, just set all of them + * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a + * subset of HK_TYPE_DOMAIN housekeeping CPUs. + */ + if (cgrpv2) { + for (i =3D 0; i < ndoms; i++) { + cpumask_copy(doms[i], csa[i]->effective_cpus); + if (dattr) + dattr[i] =3D SD_ATTR_INIT; + } + goto done; + } + for (nslot =3D 0, i =3D 0; i < csn; i++) { struct cpuset *a =3D csa[i]; struct cpumask *dp; @@ -1231,7 +1260,7 @@ static void rebuild_sched_domains_locked(void) * root should be only a subset of the active CPUs. Since a CPU in any * partition root could be offlined, all must be checked. */ - if (top_cpuset.nr_subparts) { + if (!cpumask_empty(subpartitions_cpus)) { rcu_read_lock(); cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { if (!is_partition_valid(cs)) { @@ -4575,7 +4604,7 @@ static void cpuset_handle_hotplug(void) * In the rare case that hotplug removes all the cpus in * subpartitions_cpus, we assumed that cpus are updated. */ - if (!cpus_updated && top_cpuset.nr_subparts) + if (!cpus_updated && !cpumask_empty(subpartitions_cpus)) cpus_updated =3D true; =20 /* For v1, synchronize cpus_allowed to cpu_active_mask */ --=20 2.39.3 From nobody Thu Feb 12 23:17:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A730113A24D for ; Wed, 5 Jun 2024 17:19:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717607961; cv=none; b=GEUVKix9vCODq/sqi1ApW0m85VzjGTlRd3O4DvJXUmKSoSiUCxePIg66upyiyYsAQKql2xy4pcs7jWFeIHjIbvWzdfIGkkjr9ZOiE5ANgGEVDfBybAl56Vx+A51htDr8UyG9F4uFw/JLfyTffTwOMPAvmQbOG3h/8h1yBeM7Blc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717607961; c=relaxed/simple; bh=L7TEEYUwd/2QW/RotNsPJyMCg3/E412I4pJuYUPXeAo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=m3A4s3eoYp5Ws8QIeLbTTb+j4M34aKyHK4XSKGiO7baqR+3r3ImafL7Ouqo5Wj9Xwn5EhD5oXwDxn4D6crghs4GBMwx08x8oEApLRtChVmw8POZOYdP176KxDD5p7PQ444XfSRYwe8Macik2FxE34vnV40K01OyZapNRqkvkvv4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=EjjYthkE; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EjjYthkE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1717607958; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Rst5dx+i2MhU1C2WQIxk51g9B7fqZtrRVu4iWF7JOiI=; b=EjjYthkEKYn7UDj8vFoKxje+BTMuRJ2/yTVSW9Hk7dxHEsWhFI8IRLMhh0oY/5O1FeCCOa HzrJG25h9vebL2KUQB/wJaz6N0VHi7JyCxed6xcGExhGlgy5p+jp875x0e2/s1J17H/qrz U1cP1JCulnaWaP7ORpnJjRMWigCXtbE= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-49-QEn5BuBdNHq-Gj_7nXMYlw-1; Wed, 05 Jun 2024 13:19:15 -0400 X-MC-Unique: QEn5BuBdNHq-Gj_7nXMYlw-1 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 3668F195E918; Wed, 5 Jun 2024 17:19:14 +0000 (UTC) Received: from llong.com (unknown [10.22.33.216]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9244F1956055; Wed, 5 Jun 2024 17:19:12 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Xavier , Waiman Long Subject: [PATCH-cgroup 2/2] selftest/cgroup: Dump expected sched-domain data to console Date: Wed, 5 Jun 2024 13:18:58 -0400 Message-Id: <20240605171858.1323464-3-longman@redhat.com> In-Reply-To: <20240605171858.1323464-1-longman@redhat.com> References: <20240605171858.1323464-1-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Content-Type: text/plain; charset="utf-8" Unlike the list of isolated CPUs, it is not easy to programamatically determine what sched domains are being created by the scheduler just by examinng the data in various kernfs filesystems. The easiest way to get this information is by enabling /sys/kernel/debug/sched/verbose file to make those information displayed in the console. This is also what the test_cpuset_prs.sh script is doing when the -v flag is given. It is rather hard to fetch the data from the console and compare it to the expected result. An easier way is to dump the expected sched-domain information out to the console so that they can be visually compared with the actual sched domain data. However, this have to be done manually by visual inspection and so will only be done once in a while. Signed-off-by: Waiman Long --- .../selftests/cgroup/test_cpuset_prs.sh | 29 +++++++++++++++++-- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/test= ing/selftests/cgroup/test_cpuset_prs.sh index b5eb1be2248c..c5464ee4e17e 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -161,6 +161,14 @@ test_add_proc() # T =3D put a task into cgroup # O=3D =3D Write to CPU online file of # +# ECPUs - effective CPUs of cpusets +# Pstate - partition root state +# ISOLCPUS - isolated CPUs ([,]) +# +# Note that if there are 2 fields in ISOLCPUS, the first one is for +# sched-debug matching which includes offline CPUs and single-CPU partitio= ns +# while the second one is for matching cpuset.cpus.isolated. +# SETUP_A123_PARTITIONS=3D"C1-3:P1:S+ C2-3:P1:S+ C3:P1" TEST_MATRIX=3D( # old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pst= ate ISOLCPUS @@ -233,10 +241,14 @@ TEST_MATRIX=3D( A1:P0,A2:P1,A3:P2,B1:P1 2-3" " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:= ,A3:2-3,B1:4 \ A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3" + " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 . P1 0 A1:0-1,A2:= 2-3,A3:2-3,B1:4 \ + A1:P0,A2:P1,A3:P0,B1:P1" " C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:= 2,A3:3,B1:4 \ A1:P0,A2:P1,A3:P2,B1:P1 2-4,3" " C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:= 2-3,A3:4 \ A1:P0,A2:P2,A3:P1 2-4,2-3" + " C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X3-4:P1 . 0 A1:0-1,A2:= 2,A3:3-4 \ + A1:P0,A2:P2,A3:P1 2" " C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \ . . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \ A1:P0,A2:P-2,A3:P-1" @@ -556,14 +568,15 @@ check_cgroup_states() do set -- $(echo $CHK | sed -e "s/:/ /g") CGRP=3D$1 + CGRP_DIR=3D$CGRP STATE=3D$2 FILE=3D EVAL=3D$(expr substr $STATE 2 2) - [[ $CGRP =3D A2 ]] && CGRP=3DA1/A2 - [[ $CGRP =3D A3 ]] && CGRP=3DA1/A2/A3 + [[ $CGRP =3D A2 ]] && CGRP_DIR=3DA1/A2 + [[ $CGRP =3D A3 ]] && CGRP_DIR=3DA1/A2/A3 =20 case $STATE in - P*) FILE=3D$CGRP/cpuset.cpus.partition + P*) FILE=3D$CGRP_DIR/cpuset.cpus.partition ;; *) echo "Unknown state: $STATE!" exit 1 @@ -587,6 +600,16 @@ check_cgroup_states() ;; esac [[ $EVAL !=3D $VAL ]] && return 1 + + # + # For root partition, dump sched-domains info to console if + # verbose mode set for manual comparison with sched debug info. + # + [[ $VAL -eq 1 && $VERBOSE -gt 0 ]] && { + DOMS=3D$(cat $CGRP_DIR/cpuset.cpus.effective) + [[ -n "$DOMS" ]] && + echo " [$CGRP] sched-domain: $DOMS" > /dev/console + } done return 0 } --=20 2.39.3