From nobody Thu Sep 18 05:53:35 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77CE7C4332F for ; Thu, 8 Dec 2022 19:58:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229788AbiLHT6L (ORCPT ); Thu, 8 Dec 2022 14:58:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41706 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229704AbiLHT57 (ORCPT ); Thu, 8 Dec 2022 14:57:59 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1651B5F6C8 for ; Thu, 8 Dec 2022 11:57:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670529422; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=o3RT2XE+xKZNyXg1+7HlKgY8RIpsHwoxTjw78hAkIis=; b=dUTQROK7qmMdV+wsOxhz8iF8cOKe2OUTQGmk+JPw+9rQAjJQZyp1+H/rvcaBoUV7fewYOH i1GBcMraQiJU2gQNqsFFSMMvVQrQtW+NXRGoBhJWW23LjTp4OQehAe4gFrIjMzEPH6G8nk XwVNaDIm2QlyCofg6h+hZn3tQ3lNQlM= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-411-YlI51mnrM8uB4EAmYlyLOQ-1; Thu, 08 Dec 2022 14:57:00 -0500 X-MC-Unique: YlI51mnrM8uB4EAmYlyLOQ-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A1A6D868A22; Thu, 8 Dec 2022 19:56:59 +0000 (UTC) Received: from llong.com (dhcp-17-153.bos.redhat.com [10.18.17.153]) by smtp.corp.redhat.com (Postfix) with ESMTP id 71437492B04; Thu, 8 Dec 2022 19:56:59 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Waiman Long Subject: [PATCH 1/2] cgroup/cpuset: Use cpuset_rwsem read lock in cpuset_can_attach() Date: Thu, 8 Dec 2022 14:56:33 -0500 Message-Id: <20221208195634.2604362-2-longman@redhat.com> In-Reply-To: <20221208195634.2604362-1-longman@redhat.com> References: <20221208195634.2604362-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem"), cpuset_mutex is changed to cpuset_rwsem. This has the undesirable side effect of increasing the latency to take cpuset_rwsem write lock, potentially up to a RCU grace period later which can be especially problematic on systems with a large number of CPU cores. One particular pain point is moving a task from one cpuset to another one. The locking sequence for such a task migration is as follow: cgroup_mutex =3D> cgroup_threadgroup_rwsem =3D> cpuset_rwsem The cpuset_rwsem write lock has to be taken twice - at both cpuset_can_attach() and cpuset_attach(). This can create significant delay in blocking the fork/exit path while the task migration is in progress. Reduce that latency by using cpuset_rwsem read lock in cpuset_can_attach() and cpuset_cancel_attach() as they don't need to change anything in the cpuset except attach_in_progress which is now changed to an atomic_t type to allow proper concurrent update while holding cpuset_rwsem read lock. The attach_in_progress field is only read while holding cpuset_rwsem write lock avoiding possible race condition. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 39 ++++++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b474289c15b8..800c65de5daa 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -171,7 +171,7 @@ struct cpuset { * Tasks are being attached to this cpuset. Used to prevent * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). */ - int attach_in_progress; + atomic_t attach_in_progress; =20 /* partition number for rebuild_sched_domains() */ int pn; @@ -369,11 +369,14 @@ static struct cpuset top_cpuset =3D { * There are two global locks guarding cpuset structures - cpuset_rwsem and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of t= his - * comment. The cpuset code uses only cpuset_rwsem write lock. Other - * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to - * prevent change to cpuset structures. - * - * A task must hold both locks to modify cpusets. If a task holds + * comment. The cpuset code uses mostly cpuset_rwsem write lock with the + * exception of cpuset_can_attach() and cpuset_cancel_attach(). Other ker= nel + * subsystems can use cpuset_read_lock()/cpuset_read_unlock() to prevent + * change to cpuset structures. + * + * A task must hold both locks to modify cpusets except attach_in_progress + * which can be modified by either holding cpuset_rwsem read or write + * lock and to be read under cpuset_rwsem write lock. If a task holds * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it * is the only task able to also acquire callback_lock and be able to * modify cpusets. It can perform various checks on the cpuset structure @@ -746,7 +749,8 @@ static int validate_change(struct cpuset *cur, struct c= puset *trial) * be changed to have empty cpus_allowed or mems_allowed. */ ret =3D -ENOSPC; - if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) { + if ((cgroup_is_populated(cur->css.cgroup) || + atomic_read(&cur->attach_in_progress))) { if (!cpumask_empty(cur->cpus_allowed) && cpumask_empty(trial->cpus_allowed)) goto out; @@ -2448,7 +2452,7 @@ static int cpuset_can_attach(struct cgroup_taskset *t= set) cpuset_attach_old_cs =3D task_cs(cgroup_taskset_first(tset, &css)); cs =3D css_cs(css); =20 - percpu_down_write(&cpuset_rwsem); + percpu_down_read(&cpuset_rwsem); =20 /* allow moving tasks into an empty cpuset if on default hierarchy */ ret =3D -ENOSPC; @@ -2475,10 +2479,10 @@ static int cpuset_can_attach(struct cgroup_taskset = *tset) * Mark attach is in progress. This makes validate_change() fail * changes which zero cpus/mems_allowed. */ - cs->attach_in_progress++; + atomic_inc(&cs->attach_in_progress); ret =3D 0; out_unlock: - percpu_up_write(&cpuset_rwsem); + percpu_up_read(&cpuset_rwsem); return ret; } =20 @@ -2488,9 +2492,9 @@ static void cpuset_cancel_attach(struct cgroup_taskse= t *tset) =20 cgroup_taskset_first(tset, &css); =20 - percpu_down_write(&cpuset_rwsem); - css_cs(css)->attach_in_progress--; - percpu_up_write(&cpuset_rwsem); + percpu_down_read(&cpuset_rwsem); + atomic_dec(&css_cs(css)->attach_in_progress); + percpu_up_read(&cpuset_rwsem); } =20 /* @@ -2562,8 +2566,8 @@ static void cpuset_attach(struct cgroup_taskset *tset) =20 cs->old_mems_allowed =3D cpuset_attach_nodemask_to; =20 - cs->attach_in_progress--; - if (!cs->attach_in_progress) + atomic_inc(&cs->attach_in_progress); + if (!atomic_read(&cs->attach_in_progress)) wake_up(&cpuset_attach_wq); =20 percpu_up_write(&cpuset_rwsem); @@ -3072,6 +3076,7 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_c= ss) nodes_clear(cs->mems_allowed); nodes_clear(cs->effective_mems); fmeter_init(&cs->fmeter); + atomic_set(&cs->attach_in_progress, 0); cs->relax_domain_level =3D -1; =20 /* Set CS_MEMORY_MIGRATE for default hierarchy */ @@ -3383,7 +3388,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset= *cs, struct tmpmasks *tmp) bool mems_updated; struct cpuset *parent; retry: - wait_event(cpuset_attach_wq, cs->attach_in_progress =3D=3D 0); + wait_event(cpuset_attach_wq, atomic_read(&cs->attach_in_progress) =3D=3D = 0); =20 percpu_down_write(&cpuset_rwsem); =20 @@ -3391,7 +3396,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset= *cs, struct tmpmasks *tmp) * We have raced with task attaching. We wait until attaching * is finished, so we won't attach a task to an empty cpuset. */ - if (cs->attach_in_progress) { + if (atomic_read(&cs->attach_in_progress)) { percpu_up_write(&cpuset_rwsem); goto retry; } --=20 2.31.1 From nobody Thu Sep 18 05:53:35 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F585C4332F for ; Thu, 8 Dec 2022 19:58:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229781AbiLHT6G (ORCPT ); Thu, 8 Dec 2022 14:58:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41704 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229737AbiLHT6A (ORCPT ); Thu, 8 Dec 2022 14:58:00 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64DC85F6C4 for ; Thu, 8 Dec 2022 11:57:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670529421; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=s03T6k1fAEZ5y4Sx1FORStjPoY8HDaI7yoRA/+1yXuc=; b=XfWZ/qzTnR9dT7BdPeNYYBVdx+V6PR4Ef/QBvk6MZ/aH6dzPDHO9T5Cnq/7D6b4MbilvtD N+xjSl51gHlqle9w61yG4O5C4HcCBilayHHI3GE1D6Ng9ReQslJfR6i5Qii7OL5SJUq4yT F3CtIyH66wFn7Wo5xI6Ag599E1bSg8s= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-417-he_nquLLOEqRq0Qhs5-jNg-1; Thu, 08 Dec 2022 14:57:00 -0500 X-MC-Unique: he_nquLLOEqRq0Qhs5-jNg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DAF63858F13; Thu, 8 Dec 2022 19:56:59 +0000 (UTC) Received: from llong.com (dhcp-17-153.bos.redhat.com [10.18.17.153]) by smtp.corp.redhat.com (Postfix) with ESMTP id A9D45492B05; Thu, 8 Dec 2022 19:56:59 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Waiman Long Subject: [PATCH 2/2] cgroup/cpuset: Make percpu cpuset_rwsem operation depending on DYNMODS state Date: Thu, 8 Dec 2022 14:56:34 -0500 Message-Id: <20221208195634.2604362-3-longman@redhat.com> In-Reply-To: <20221208195634.2604362-1-longman@redhat.com> References: <20221208195634.2604362-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" With commit 6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem operations optional"), users can determine if they favor optimizing for efficiently moving processes between cgroups frequently or for a more static usage pattern where moving processes among cgroups are relatively rare. The percpu cpuset_rwsem is in the same boat as percpu_threadgroup_rwsem since moving processes among cpusets will have the same latency impact depending on whether percpu operation in cpuset_rwsem is disabled or not. Ideally cpuset_bind() is the best place to check if the cpuset_rwsem should have its reader fast path disabled like percpu_threadgroup_rwsem so that it gets to be re-evaluated every time the cpuset is rebound. Unfortunately, cgroup_favor_dynmods() that sets the CGRP_ROOT_FAVOR_DYNMODS flag is called after the bind() method call. Instead, the newly added cpuset_check_dynmods() function is called at the first cpuset_css_online() call after a cpuset_bind() call when the first child cpuset is created. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 800c65de5daa..daf8ca948176 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -255,6 +255,7 @@ typedef enum { CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_FAVOR_DYNMODS, /* top_cpuset only */ } cpuset_flagbits_t; =20 /* convenient tests for these bits */ @@ -3049,6 +3050,27 @@ static struct cftype dfl_files[] =3D { { } /* terminate */ }; =20 +static bool dynmods_checked __read_mostly; +static void cpuset_check_dynmods(struct cgroup_root *root) +{ + bool favor_dynmods; + + lockdep_assert_held(&cgroup_mutex); + percpu_rwsem_assert_held(&cpuset_rwsem); + + /* + * Check the CGRP_ROOT_FAVOR_DYNMODS of the cgroup root to find out + * if we need to enable or disable reader fast path of cpuset_rwsem. + */ + favor_dynmods =3D test_bit(CS_FAVOR_DYNMODS, &top_cpuset.flags); + if (favor_dynmods && !(root->flags & CGRP_ROOT_FAVOR_DYNMODS)) { + rcu_sync_exit(&cpuset_rwsem.rss); + clear_bit(CS_FAVOR_DYNMODS, &top_cpuset.flags); + } else if (!favor_dynmods && (root->flags & CGRP_ROOT_FAVOR_DYNMODS)) { + rcu_sync_enter(&cpuset_rwsem.rss); + set_bit(CS_FAVOR_DYNMODS, &top_cpuset.flags); + } +} =20 /* * cpuset_css_alloc - allocate a cpuset css @@ -3099,6 +3121,14 @@ static int cpuset_css_online(struct cgroup_subsys_st= ate *css) cpus_read_lock(); percpu_down_write(&cpuset_rwsem); =20 + /* + * Check dynmod state on the first css_online() call. + */ + if (unlikely(!dynmods_checked)) { + cpuset_check_dynmods(cpuset_cgrp_subsys.root); + dynmods_checked =3D true; + } + set_bit(CS_ONLINE, &cs->flags); if (is_spread_page(parent)) set_bit(CS_SPREAD_PAGE, &cs->flags); @@ -3201,6 +3231,12 @@ static void cpuset_css_free(struct cgroup_subsys_sta= te *css) =20 static void cpuset_bind(struct cgroup_subsys_state *root_css) { + /* + * Reset dynmods_checked to be evaluated again in the next + * cpuset_css_online() + */ + dynmods_checked =3D false; + percpu_down_write(&cpuset_rwsem); spin_lock_irq(&callback_lock); =20 --=20 2.31.1