From nobody Tue May 5 10:02:26 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EDC7C433F5 for ; Wed, 25 May 2022 15:15:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245078AbiEYPPj (ORCPT ); Wed, 25 May 2022 11:15:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245059AbiEYPP3 (ORCPT ); Wed, 25 May 2022 11:15:29 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10722B0433; Wed, 25 May 2022 08:15:28 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 3E407219EE; Wed, 25 May 2022 15:15:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1653491727; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tls0/nqHOIQNk1w3ReHoYvYzJRQedtJDKDjgdFYZ8bY=; b=XVCkXOLy6j3sLPyCCTKlGK2VYVNWbMyG053XXSuCzHKN7oLawqrxNsOE8CPr/MGxwIp+P0 IFMV3sYal8Rya+zWeKRkik30FqoEwdhr6xXGge0P2g2UR2A6yuFc+vBLKonZuqFLkzN7lT HEHhxVeM7MwQHTLlWXs17J429qsjtOs= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1214E13ADF; Wed, 25 May 2022 15:15:27 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id qFtsAw9IjmLXAwAAMHmgww (envelope-from ); Wed, 25 May 2022 15:15:27 +0000 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= To: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Tejun Heo , Zefan Li , Johannes Weiner , Bui Quang Minh , Tadeusz Struk Subject: [PATCH 1/2] cgroup: Wait for cgroup_subsys_state offlining on unmount Date: Wed, 25 May 2022 17:15:16 +0200 Message-Id: <20220525151517.8430-2-mkoutny@suse.com> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220525151517.8430-1-mkoutny@suse.com> References: <20220525151517.8430-1-mkoutny@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The reported problem here occurs when cgroup hierarchy is unmounted quickly after last cgroup removal. The last cgroup prevents the root cgroup css->refcnt from being killed. The respective cgroup root thus remains permanently in existence. This is actually intended behavior for memory controller whose state is long-lived and there is no better option to attach it later (see also commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory controller lifetime")). We can make the situation better by checking children list only after any cgroups in the middle of removal are gone, detected via cgroup_destroy_wq. Reported-by: Bui Quang Minh Link: https://lore.kernel.org/r/20220404142535.145975-1-minhquangbui99@gmai= l.com Signed-off-by: Michal Koutn=C3=BD --- kernel/cgroup/cgroup.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index adb820e98f24..a5b0d5d54fbc 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb) struct cgroup_root *root =3D cgroup_root_from_kf(kf_root); =20 /* - * If @root doesn't have any children, start killing it. + * If @root doesn't have any children held by residual state (e.g. + * memory controller), start killing it, flush workqueue to filter out + * transiently offlined children. * This prevents new mounts by disabling percpu_ref_tryget_live(). * * And don't kill the default root. */ + flush_workqueue(cgroup_destroy_wq); if (list_empty(&root->cgrp.self.children) && root !=3D &cgrp_dfl_root && !percpu_ref_is_dying(&root->cgrp.self.refcnt)) { cgroup_bpf_offline(&root->cgrp); --=20 2.35.3 From nobody Tue May 5 10:02:26 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E64DC433FE for ; Wed, 25 May 2022 15:16:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245109AbiEYPPl (ORCPT ); Wed, 25 May 2022 11:15:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52394 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245058AbiEYPP3 (ORCPT ); Wed, 25 May 2022 11:15:29 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10BCDB043B; Wed, 25 May 2022 08:15:28 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 6ED0B1F897; Wed, 25 May 2022 15:15:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1653491727; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U46sVGA1KUZXQikRUajt78EcsqZuBl8M4p/DB6/tVgE=; b=C4m3v6uP9vSwVlZGU1QI7oR1B8lKsARQbrKwIPAyVOq4Zywe1QM7r5RuFo4rfrcFeB+oK1 b92ZdIszTgbTgGCVvszf0KkL5V95soSXeZgur3d/X49AtEvhF0Zdo8b9fGPd7hUeCITXTy +71UMub165Q2Vn/W590QZNe0rh21SWI= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 41D3A13B2B; Wed, 25 May 2022 15:15:27 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id WPIUDw9IjmLXAwAAMHmgww (envelope-from ); Wed, 25 May 2022 15:15:27 +0000 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= To: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Tejun Heo , Zefan Li , Johannes Weiner , Bui Quang Minh , Tadeusz Struk Subject: [PATCH 2/2] cgroup: Use separate work structs on css release path Date: Wed, 25 May 2022 17:15:17 +0200 Message-Id: <20220525151517.8430-3-mkoutny@suse.com> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20220525151517.8430-1-mkoutny@suse.com> References: <20220525151517.8430-1-mkoutny@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both kill and release callbacks on their release path (see comment for css_free_rwork_fn()). When the last reference is also the base reference, we run into issues when active work_struct (1) is re-initialized from css_release (2). // ref=3D1: only base reference kill_css() css_get() // fuse, ref+=3D1 =3D=3D 2 percpu_ref_kill_and_confirm // ref -=3D 1 =3D=3D 1: kill base references [via rcu] css_killed_ref_fn =3D=3D refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn =3D=3D wq.func offline_css() // needs fuse css_put // ref -=3D 1 =3D=3D 0: de-fuse, was last ... percpu_ref_put_many css_release queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn =3D=3D wq.func Despite we take a fuse reference in css_killed_work_fn() it serves for pinning the css until only after offline_css(). We could check inside css_release whether destroy_work is active (WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from css_release(). In order to avoid clashes with various stages of the work item processing, we just spend some space in css (my config's css grows to 232B + 32B) and create a separate work entry for each user. Reported-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Reported-by: Tadeusz Struk Link: https://lore.kernel.org/r/20220412192459.227740-1-tadeusz.struk@linar= o.org/ Signed-off-by: Tadeusz Struk Signed-off-by: Michal Koutn=C3=BD --- include/linux/cgroup-defs.h | 5 +++-- kernel/cgroup/cgroup.c | 14 +++++++------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..16b99aa04305 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -178,8 +178,9 @@ struct cgroup_subsys_state { */ atomic_t online_cnt; =20 - /* percpu_ref killing and RCU release */ - struct work_struct destroy_work; + /* percpu_ref killing, css release, and RCU release work structs */ + struct work_struct killed_ref_work; + struct work_struct release_work; struct rcu_work destroy_rwork; =20 /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a5b0d5d54fbc..33b3a44391d7 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] =3D { * css_free_work_fn(). * * It is actually hairier because both step 2 and 4 require process context - * and thus involve punting to css->destroy_work adding two additional + * and thus involve punting to css->release_work adding two additional * steps to the already complex sequence. */ static void css_free_rwork_fn(struct work_struct *work) @@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *wor= k) static void css_release_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css =3D - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, release_work); struct cgroup_subsys *ss =3D css->ss; struct cgroup *cgrp =3D css->cgroup; =20 @@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css =3D container_of(ref, struct cgroup_subsys_state, refcnt); =20 - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->release_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->release_work); } =20 static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const= char *name, umode_t mode) static void css_killed_work_fn(struct work_struct *work) { struct cgroup_subsys_state *css =3D - container_of(work, struct cgroup_subsys_state, destroy_work); + container_of(work, struct cgroup_subsys_state, killed_ref_work); =20 mutex_lock(&cgroup_mutex); =20 @@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) container_of(ref, struct cgroup_subsys_state, refcnt); =20 if (atomic_dec_and_test(&css->online_cnt)) { - INIT_WORK(&css->destroy_work, css_killed_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + INIT_WORK(&css->killed_ref_work, css_killed_work_fn); + queue_work(cgroup_destroy_wq, &css->killed_ref_work); } } =20 --=20 2.35.3