From nobody Mon Feb  9 12:28:47 2026
Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com
 [45.249.212.51])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15CB339FD9;
	Tue, 30 Dec 2025 07:17:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=45.249.212.51
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1767079079; cv=none;
 b=qZ2FF/KZNySfRpnStBvKKWPKmwQEqSXaWNbgVir0wNcQzh8rttk9DX9CIMg4qeTbzvbgCDomMo34ukuliKVDBtphmlmufebOFwMSoZfAWvTV/1HxRu5yWKSQ+GKZ6xnkBu+uCaIu2xa9QHEBliTnq7D1snoxFK30DvFlMQxAacQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1767079079; c=relaxed/simple;
	bh=E1Rz9lMhsCZ9uEElurviw+JiOh0iLb9Yiq8G5JNmj5w=;
	h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type;
 b=VIRCnay3Hn9Q+QXG3bEnjAMUsQRWASRUsaITIQtNOJEjeWor5VQ2ZfGp4NAMLQ9T/QBDZJ0zc+b1dOpKesAYVPNtQCkPVDEG+j2c3Nq/kLwy892z3s6pj2Ecn/27srkpEZxFS3fcrVJxabH9TwoMo72u96fdmDnYXx+AG7nLZQg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=huaweicloud.com;
 spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.51
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=huaweicloud.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=huaweicloud.com
Received: from mail.maildlp.com (unknown [172.19.163.170])
	by dggsgout11.his.huawei.com (SkyGuard) with ESMTPS id 4dgPXq45WlzYQtxx;
	Tue, 30 Dec 2025 15:16:59 +0800 (CST)
Received: from mail02.huawei.com (unknown [10.116.40.128])
	by mail.maildlp.com (Postfix) with ESMTP id E72394056C;
	Tue, 30 Dec 2025 15:17:47 +0800 (CST)
Received: from huaweicloud.com (unknown [10.50.85.136])
	by APP4 (Coremail) with SMTP id gCh0CgAXd_eYfFNpfl_6Bw--.33850S4;
	Tue, 30 Dec 2025 15:17:45 +0800 (CST)
From: Wang Zhaolong <wangzhaolong@huaweicloud.com>
To: trondmy@kernel.org,
	anna@kernel.org,
	kolga@netapp.com
Cc: linux-nfs@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	yi.zhang@huawei.com,
	yangerkun@huawei.com,
	chengzhihao1@huawei.com,
	lilingfeng3@huawei.com,
	zhangjian496@huawei.com,
	wangzhaolong@huaweicloud.com
Subject: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can
 deadlock state manager creation
Date: Tue, 30 Dec 2025 15:17:44 +0800
Message-Id: <20251230071744.9762-1-wangzhaolong@huaweicloud.com>
X-Mailer: git-send-email 2.34.3
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-CM-TRANSID: gCh0CgAXd_eYfFNpfl_6Bw--.33850S4
X-Coremail-Antispam: 1UD129KBjvJXoW3JF13KF45AFWxZrWkJF1fZwb_yoWftr4rpF
	WUGr98KrWkJr18Wrn7ZF48Z3WYy397Gr47JFyxG34ay3Z8J3ZxKFy2y3WYvFy5GrW8Jan2
	qF1vyFW0va15AFJanT9S1TB71UUUUUJqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDU0xBIdaVrnRJUUU9E14x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0
	rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02
	1l84ACjcxK6xIIjxv20xvE14v26w1j6s0DM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U
	JVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc
	CE3s1ln4kS14v26r1Y6r17M2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI64kE
	6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r106r15McIj6I8E87Iv67AKxVWUJVW8JwAm72
	CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lF7I21c0EjII2zVCS5cI20VAGYxC7
	M4IIrI8v6xkF7I0E8cxan2IY04v7MxkF7I0En4kS14v26r4a6rW5MxAIw28IcxkI7VAKI4
	8JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr4lx2IqxVCjr7xv
	wVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUtVW8ZwCIc40Y0x0EwIxGrwCI42IY6xIIjx
	v20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJwCI42IY6xAIw20E
	Y4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Jr0_Gr1lIxAIcVC2z280aVCY1x0267
	AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7sRRoq2tUUUUU==
X-CM-SenderInfo: pzdqw6xkdrz0tqj6x35dzhxuhorxvhhfrp/

Hi all,

I=E2=80=99d like to start an RFC discussion about a hung-task/deadlock that=
 we hit in
production-like testing on NFSv4.1 clients under server outage + memory
pressure. The system becomes stuck even after the server/network is restore=
d.

The scenario is:

- NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style workl=
oad)
- server outage (restart/power-off) and/or network blackhole
- client under significant memory pressure / reclaim activity (observed in =
the
  traces below)

The observed behavior is a deadlock cycle involving:

- v4.1 session slot table =E2=80=9Cdraining=E2=80=9D (NFS4_SLOT_TBL_DRAININ=
G)
- state manager thread creation via kthread_run()
- kthreadd entering direct reclaim and getting stuck in NFS commit/writebac=
k paths
- non-privileged RPC tasks sleeping on slot table waitq due to draining

Below is the call-chain I reconstructed from traces (three key participants=
):

P1: sunrpc worker 1 (error handling triggers session recovery and tries to =
startstate manager)
rpc_exit_task
  nfs_writeback_done
    nfs4_write_done
      nfs4_sequence_done
        nfs41_sequence_process
          // status error, goto session_recover
          set_bit(NFS4_SLOT_TBL_DRAINING, &session->fc_slot_table.slot_tbl_=
state)  <1>
          nfs4_schedule_session_recovery
            nfs4_schedule_state_manager
              kthread_run  // - Create a state manager thread to release th=
e draining slots
                kthread_create_on_node
                  __kthread_create_on_node
                    wait_for_completion(&done);   <2> wait for <3>

P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS folios and=
 blocks in commit wait)
kthreadd
 kernel_thread
  kernel_clone
   copy_process
    dup_task_struct
     alloc_thread_stack_node
      __vmalloc_node_range
       __vmalloc_area_node
        vm_area_alloc_pages
         alloc_pages_bulk_array_mempolicy
          __alloc_pages_bulk
           __alloc_pages
            __perform_reclaim
             try_to_free_pages
              do_try_to_free_pages
               shrink_zones
                shrink_node
                 shrink_node_memcgs
                  shrink_lruvec
                   shrink_inactive_list
                    shrink_folio_list
                     filemap_release_folio
                      nfs_release_folio
                       nfs_wb_folio
                        folio PG_private !PG_writeback !PG_dirty
                         nfs_commit_inode(inode, FLUSH_SYNC);
                          __nfs_commit_inode
                           nfs_generic_commit_list
                            nfs_commit_list
                             nfs_initiate_commit
                              rpc_run_task  // Async task
                           wait_on_commit            <3> wait for <4>

P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
__rpc_execute
  nfs4_setup_sequence
    // if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto sleep
    rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL);    <4>  blocked by <1>

This forms a deadlock:

- <1> enables draining; non-privileged requests then block at <4>
- recovery path attempts to create the state manager thread, but
  blocks at <2> waiting for kthreadd
- kthreadd is blocked at <3> waiting for commit progress / completion,
  but commit/RPC progress is impeded because requests are stuck behind drai=
ning at <4>
- once in this state, restoring the server/network does not resolve the dea=
dlock

I suspect this deadlock became possible after the following mainline change=
 that
freezes the session table immediately on NFS4ERR_BADSESSION (and similar er=
ror paths):

c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4ERR_BAD=
SESSION")

It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:

Questions:

1. Has anyone else observed a similar deadlock involving slot table drainin=
g + memory
   reclaim? It looks like a similar issue might have been reported before =
=E2=80=94 see
   SUSE Bugzilla #1211527. [1]
2. Is it intended that kthreadd (or other critical kernel threads) may bloc=
k in
   nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
3. Is there an established way to ensure recovery threads can always be cre=
ated even
   under severe memory pressure (e.g., reserve resources, GFP flags, or mov=
ing state
   manager creation out of contexts that can trigger reclaim)?

I wrote a local patch purely as a discussion starter. I realize this approa=
ch is likely
not the right solution upstream; I=E2=80=99m sharing it only to help reason=
 about where the
cycle can be broken. I can post the patch if people think it=E2=80=99s usef=
ul for the discussion.

Link: https://access.redhat.com/solutions/7016754 [1]
Fixes: c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4=
ERR_BADSESSION")
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
---
 fs/nfs/file.c          |  6 +++---
 fs/nfs/write.c         | 10 +++++-----
 include/linux/nfs_fs.h |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..e556a16ce95b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -487,11 +487,11 @@ static void nfs_invalidate_folio(struct folio *folio,=
 size_t offset,
 	dfprintk(PAGECACHE, "NFS: invalidate_folio(%lu, %zu, %zu)\n",
 		 folio->index, offset, length);
=20
 	/* Cancel any unstarted writes on this page */
 	if (offset !=3D 0 || length < folio_size(folio))
-		nfs_wb_folio(inode, folio);
+		nfs_wb_folio(inode, folio, true);
 	else
 		nfs_wb_folio_cancel(inode, folio);
 	folio_wait_private_2(folio); /* [DEPRECATED] */
 	trace_nfs_invalidate_folio(inode, folio_pos(folio) + offset, length);
 }
@@ -509,11 +509,11 @@ static bool nfs_release_folio(struct folio *folio, gf=
p_t gfp)
 	/* If the private flag is set, then the folio is not freeable */
 	if (folio_test_private(folio)) {
 		if ((current_gfp_context(gfp) & GFP_KERNEL) !=3D GFP_KERNEL ||
 		    current_is_kswapd() || current_is_kcompactd())
 			return false;
-		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+		if (nfs_wb_folio(folio->mapping->host, folio, false) < 0)
 			return false;
 	}
 	return nfs_fscache_release_folio(folio, gfp);
 }
=20
@@ -558,11 +558,11 @@ static int nfs_launder_folio(struct folio *folio)
=20
 	dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
 		inode->i_ino, folio_pos(folio));
=20
 	folio_wait_private_2(folio); /* [DEPRECATED] */
-	ret =3D nfs_wb_folio(inode, folio);
+	ret =3D nfs_wb_folio(inode, folio, true);
 	trace_nfs_launder_folio_done(inode, folio_pos(folio),
 			folio_size(folio), ret);
 	return ret;
 }
=20
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..bc541a192197 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1059,11 +1059,11 @@ static struct nfs_page *nfs_try_to_update_request(s=
truct folio *folio,
 	 * nfs_lock_and_join_requests() cannot preserve
 	 * commit flags, so we have to replay the write.
 	 */
 	nfs_mark_request_dirty(req);
 	nfs_unlock_and_release_request(req);
-	error =3D nfs_wb_folio(folio->mapping->host, folio);
+	error =3D nfs_wb_folio(folio->mapping->host, folio, true);
 	trace_nfs_try_to_update_request_done(folio_inode(folio), offset, bytes, e=
rror);
 	return (error < 0) ? ERR_PTR(error) : NULL;
 }
=20
 /*
@@ -1137,11 +1137,11 @@ int nfs_flush_incompatible(struct file *file, struc=
t folio *folio)
 			do_flush |=3D l_ctx->lockowner !=3D current->files;
 		}
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status =3D nfs_wb_folio(folio->mapping->host, folio);
+		status =3D nfs_wb_folio(folio->mapping->host, folio, true);
 	} while (status =3D=3D 0);
 	return status;
 }
=20
 /*
@@ -2030,11 +2030,11 @@ int nfs_wb_folio_cancel(struct inode *inode, struct=
 folio *folio)
  * @folio: pointer to folio
  *
  * Assumes that the folio has been locked by the caller, and will
  * not unlock it.
  */
-int nfs_wb_folio(struct inode *inode, struct folio *folio)
+int nfs_wb_folio(struct inode *inode, struct folio *folio, bool sync)
 {
 	loff_t range_start =3D folio_pos(folio);
 	size_t len =3D folio_size(folio);
 	struct writeback_control wbc =3D {
 		.sync_mode =3D WB_SYNC_ALL,
@@ -2055,11 +2055,11 @@ int nfs_wb_folio(struct inode *inode, struct folio =
*folio)
 			continue;
 		}
 		ret =3D 0;
 		if (!folio_test_private(folio))
 			break;
-		ret =3D nfs_commit_inode(inode, FLUSH_SYNC);
+		ret =3D nfs_commit_inode(inode, sync ? FLUSH_SYNC: 0);
 		if (ret < 0)
 			goto out_error;
 	}
 out_error:
 	trace_nfs_writeback_folio_done(inode, range_start, len, ret);
@@ -2078,11 +2078,11 @@ int nfs_migrate_folio(struct address_space *mapping=
, struct folio *dst,
 	 *        that we can safely release the inode reference while holding
 	 *        the folio lock.
 	 */
 	if (folio_test_private(src)) {
 		if (mode =3D=3D MIGRATE_SYNC)
-			nfs_wb_folio(src->mapping->host, src);
+			nfs_wb_folio(src->mapping->host, src, true);
 		if (folio_test_private(src))
 			return -EBUSY;
 	}
=20
 	if (folio_test_private_2(src)) { /* [DEPRECATED] */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..295bc6214750 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -634,11 +634,11 @@ extern int  nfs_update_folio(struct file *file, struc=
t folio *folio,
  * Try to write back everything synchronously (but check the
  * return value!)
  */
 extern int nfs_sync_inode(struct inode *inode);
 extern int nfs_wb_all(struct inode *inode);
-extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio(struct inode *inode, struct folio *folio, bool syn=
c);
 int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
 extern int  nfs_commit_inode(struct inode *, int);
 extern struct nfs_commit_data *nfs_commitdata_alloc(void);
 extern void nfs_commit_free(struct nfs_commit_data *data);
 void nfs_commit_begin(struct nfs_mds_commit_info *cinfo);
--=20
2.34.3