From nobody Fri Nov 29 23:35:50 2024 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AA891DA610; Fri, 13 Sep 2024 13:25:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233938; cv=none; b=BMseHi2UxbmDava+4kZg6ifFU0ugL/kEVHXgN0+gJkNpI2MExxFDD8O8rcEGkUKmFAU1uAmzngiKk/YJ8Xe2NmxCd0LQ7c6pFxCMVcDDzbMy359Ert1vcFIHLIq4C4fWwvEll7SVmqsuWauLMbXFEGufQEqXopq9fwAPBBanB9w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233938; c=relaxed/simple; bh=yjDSVydC9XmqKwk9cyO6jma5IlqOR9x3ARLftXVwpqs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=tbdzO0UppNRzTTdyLIE0STaSY5As1wwQcE5Gga8yfeM62WyC8BXPgQMH2PJzQwXi6vLEybhzlHxBIHS41JBsU3KshZOS33KWKP7aOvaM6Me+9cl/Fa73+DGl79o2cc8/lBAGEE+ZOU9KjA3aVeIqTXi3CLsxhVboOV40wpLB1YM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.48]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4X4w6D2qBMz69WF; Fri, 13 Sep 2024 21:25:24 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id D90A118009B; Fri, 13 Sep 2024 21:25:32 +0800 (CST) Received: from huawei.com (10.67.174.121) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Fri, 13 Sep 2024 21:25:31 +0800 From: Chen Ridong To: , , , , , , , , , , , , , , , CC: , , Subject: [PATCH v4 1/3] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock Date: Fri, 13 Sep 2024 13:17:18 +0000 Message-ID: <20240913131720.1762188-2-chenridong@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240913131720.1762188-1-chenridong@huawei.com> References: <20240913131720.1762188-1-chenridong@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemd100013.china.huawei.com (7.221.188.163) We found a hung_task problem as shown below: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Workqueue: events cgroup_bpf_release Call Trace: __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(st= ep4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Reviewed-by: Michal Koutn=C3=BD Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup = itself") Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@= huawei.com/T/#t Signed-off-by: Chen Ridong --- kernel/bpf/cgroup.c | 2 +- kernel/cgroup/cgroup-internal.h | 1 + kernel/cgroup/cgroup.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index e7113d700b87..a8804f62bc25 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *re= f) struct cgroup *cgrp =3D container_of(ref, struct cgroup, bpf.refcnt); =20 INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); - queue_work(system_wq, &cgrp->bpf.release_work); + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); } =20 /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's thro= ugh diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-interna= l.h index c964dd7ff967..17ac19bc8106 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -13,6 +13,7 @@ extern spinlock_t trace_cgroup_path_lock; extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; extern void __init enable_debug_cgroup(void); +extern struct workqueue_struct *cgroup_destroy_wq; =20 /* * cgroup_path() takes a spin lock. It is good practice not to take diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2032dc501427..a94ea6b993be 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); * destruction work items don't end up filling up max_active of system_wq * which may lead to deadlock. */ -static struct workqueue_struct *cgroup_destroy_wq; +struct workqueue_struct *cgroup_destroy_wq; =20 /* generate an array of cgroup subsystem pointers */ #define SUBSYS(_x) [_x ## _cgrp_id] =3D &_x ## _cgrp_subsys, --=20 2.34.1 From nobody Fri Nov 29 23:35:50 2024 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C8811E892; Fri, 13 Sep 2024 13:25:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233939; cv=none; b=KdhNOLykSY+Rip7KCc0CfwiLGfTUL/UU/3YQmiN1eBGwUqN6KC2Il+y+FqNpksHy+JWSi7QiS41gq0ATwJ1QZOExmZ+l/pj/VawAYg0mU/W3XyO0bK8+Tk4ySWMbzmnmCIkkRjJRotdr+ok61A+w8aliKdgc9pRIqIuoJPbCKe0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233939; c=relaxed/simple; bh=jrkvQb9JHC+R7KVldUDHK5KkjrVybDN0XoN4ddqbUPY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=T5txcDJmC8+QLATR9jaV6K9NK7QKWnglhQaoSTaq5p7OZr1Z3J6bR26jkOqwJkaJ5e1bhnoyUQnnNiJZS47JgPRFlLJuwmn9XxtGVgy7FGlJzD/PYcLhmh45m+6fckPSnvEbyl2PFMJvdJKMR69ahU13nIAYA4OhCdji/T/43gY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4X4w6F2wj6z20nvp; Fri, 13 Sep 2024 21:25:25 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id A7D5A1402CD; Fri, 13 Sep 2024 21:25:33 +0800 (CST) Received: from huawei.com (10.67.174.121) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Fri, 13 Sep 2024 21:25:32 +0800 From: Chen Ridong To: , , , , , , , , , , , , , , , CC: , , Subject: [PATCH v4 2/3] workqueue: doc: Add a note saturating the system_wq is not permitted Date: Fri, 13 Sep 2024 13:17:19 +0000 Message-ID: <20240913131720.1762188-3-chenridong@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240913131720.1762188-1-chenridong@huawei.com> References: <20240913131720.1762188-1-chenridong@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemd100013.china.huawei.com (7.221.188.163) Content-Type: text/plain; charset="utf-8" If something is expected to generate large number of concurrent works, it should utilize its own dedicated workqueue rather than system wq. Because this may saturate system_wq and potentially block other's works. eg, cgroup release work. Let's document this as a note. Signed-off-by: Chen Ridong --- Documentation/core-api/workqueue.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/= workqueue.rst index 16f861c9791e..338b25e86f8c 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -356,6 +356,10 @@ Guidelines special attribute, can use one of the system wq. There is no difference in execution characteristics between using a dedicated wq and a system wq. + Note: If something is expected to generate large number of concurrent + works, it should utilize its own dedicated workqueue rather than + system wq. Because this may saturate system_wq and potentially block + other's works. eg, cgroup release work. =20 * Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased --=20 2.34.1 From nobody Fri Nov 29 23:35:50 2024 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E09881DA62B; Fri, 13 Sep 2024 13:25:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233939; cv=none; b=d2r6H5eJW3kug5SezqrgsB6uEkJ57DcOMwJRtykhvS2E1o91GyN1ZRLhXh2caRZc1zhKkRTjNnLSoo3E6vOOE/hFKpwC7134pFxZNK2y8mXAwbU6++vuh/BOD0HAnuhAfWd8SP0PsQzjdQ/kbv+96jV+sIrDoSB0YQXsnVRh3eE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726233939; c=relaxed/simple; bh=VFN9R0auFtHfIEJzSpEEh6qLV1o1M/nq69Fch1WZ5wo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mcmOAIS1q8STogpedgsi4WAK1XPfaCDEA7BtrlLo+TAdPQo/clXBZpZolP/+CeQFxKtmGBgiFPKMPyS9Djd3Nh/QmJGPy5fQIHkE8oi9D8vgIJHEuvsEobK35Eqaf2HvE2kpVTph6x54JT/0S5eWlupN9BCezbk7HNyvgXXHHkQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.194]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4X4w412nbqzmV6L; Fri, 13 Sep 2024 21:23:29 +0800 (CST) Received: from kwepemd100013.china.huawei.com (unknown [7.221.188.163]) by mail.maildlp.com (Postfix) with ESMTPS id 6BF6F1401F0; Fri, 13 Sep 2024 21:25:34 +0800 (CST) Received: from huawei.com (10.67.174.121) by kwepemd100013.china.huawei.com (7.221.188.163) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Fri, 13 Sep 2024 21:25:33 +0800 From: Chen Ridong To: , , , , , , , , , , , , , , , CC: , , Subject: [PATCH v4 3/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 Date: Fri, 13 Sep 2024 13:17:20 +0000 Message-ID: <20240913131720.1762188-4-chenridong@huawei.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240913131720.1762188-1-chenridong@huawei.com> References: <20240913131720.1762188-1-chenridong@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemd100013.china.huawei.com (7.221.188.163) Content-Type: text/plain; charset="utf-8" WQ_MAX_ACTIVE is currently set to 512, which was established approximately 15 yeas ago. However, with the significant increase in machine sizes and capabilities, the previous limit of 256 concurrent tasks is no longer sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048. and WQ_DFL_ACTIVE is 1024 now. Signed-off-by: Chen Ridong --- Documentation/core-api/workqueue.rst | 4 ++-- include/linux/workqueue.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/= workqueue.rst index 338b25e86f8c..b66b55d35c9c 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -245,8 +245,8 @@ CPU which can be assigned to the work items of a wq. Fo= r example, with at the same time per CPU. This is always a per-CPU attribute, even for unbound workqueues. =20 -The maximum limit for ``@max_active`` is 512 and the default value used -when 0 is specified is 256. These values are chosen sufficiently high +The maximum limit for ``@max_active`` is 2048 and the default value used +when 0 is specified is 1024. These values are chosen sufficiently high such that they are not the limiting factor while providing protection in runaway cases. =20 diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 59c2695e12e7..b0dc957c3e56 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -412,7 +412,7 @@ enum wq_flags { }; =20 enum wq_consts { - WQ_MAX_ACTIVE =3D 512, /* I like 512, better ideas? */ + WQ_MAX_ACTIVE =3D 2048, /* I like 2048, better ideas? */ WQ_UNBOUND_MAX_ACTIVE =3D WQ_MAX_ACTIVE, WQ_DFL_ACTIVE =3D WQ_MAX_ACTIVE / 2, =20 --=20 2.34.1