NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

[PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

Posted by Wang Zhaolong 1 month, 1 week ago

Hi all,

I’d like to start an RFC discussion about a hung-task/deadlock that we hit in
production-like testing on NFSv4.1 clients under server outage + memory
pressure. The system becomes stuck even after the server/network is restored.

The scenario is:

- NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style workload)
- server outage (restart/power-off) and/or network blackhole
- client under significant memory pressure / reclaim activity (observed in the
  traces below)

The observed behavior is a deadlock cycle involving:

- v4.1 session slot table “draining” (NFS4_SLOT_TBL_DRAINING)
- state manager thread creation via kthread_run()
- kthreadd entering direct reclaim and getting stuck in NFS commit/writeback paths
- non-privileged RPC tasks sleeping on slot table waitq due to draining

Below is the call-chain I reconstructed from traces (three key participants):

P1: sunrpc worker 1 (error handling triggers session recovery and tries to startstate manager)
rpc_exit_task
  nfs_writeback_done
    nfs4_write_done
      nfs4_sequence_done
        nfs41_sequence_process
          // status error, goto session_recover
          set_bit(NFS4_SLOT_TBL_DRAINING, &session->fc_slot_table.slot_tbl_state)  <1>
          nfs4_schedule_session_recovery
            nfs4_schedule_state_manager
              kthread_run  // - Create a state manager thread to release the draining slots
                kthread_create_on_node
                  __kthread_create_on_node
                    wait_for_completion(&done);   <2> wait for <3>

P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS folios and blocks in commit wait)
kthreadd
 kernel_thread
  kernel_clone
   copy_process
    dup_task_struct
     alloc_thread_stack_node
      __vmalloc_node_range
       __vmalloc_area_node
        vm_area_alloc_pages
         alloc_pages_bulk_array_mempolicy
          __alloc_pages_bulk
           __alloc_pages
            __perform_reclaim
             try_to_free_pages
              do_try_to_free_pages
               shrink_zones
                shrink_node
                 shrink_node_memcgs
                  shrink_lruvec
                   shrink_inactive_list
                    shrink_folio_list
                     filemap_release_folio
                      nfs_release_folio
                       nfs_wb_folio
                        folio PG_private !PG_writeback !PG_dirty
                         nfs_commit_inode(inode, FLUSH_SYNC);
                          __nfs_commit_inode
                           nfs_generic_commit_list
                            nfs_commit_list
                             nfs_initiate_commit
                              rpc_run_task  // Async task
                           wait_on_commit            <3> wait for <4>

P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
__rpc_execute
  nfs4_setup_sequence
    // if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto sleep
    rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL);    <4>  blocked by <1>

This forms a deadlock:

- <1> enables draining; non-privileged requests then block at <4>
- recovery path attempts to create the state manager thread, but
  blocks at <2> waiting for kthreadd
- kthreadd is blocked at <3> waiting for commit progress / completion,
  but commit/RPC progress is impeded because requests are stuck behind draining at <4>
- once in this state, restoring the server/network does not resolve the deadlock

I suspect this deadlock became possible after the following mainline change that
freezes the session table immediately on NFS4ERR_BADSESSION (and similar error paths):

c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION")

It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:

Questions:

1. Has anyone else observed a similar deadlock involving slot table draining + memory
   reclaim? It looks like a similar issue might have been reported before — see
   SUSE Bugzilla #1211527. [1]
2. Is it intended that kthreadd (or other critical kernel threads) may block in
   nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
3. Is there an established way to ensure recovery threads can always be created even
   under severe memory pressure (e.g., reserve resources, GFP flags, or moving state
   manager creation out of contexts that can trigger reclaim)?

I wrote a local patch purely as a discussion starter. I realize this approach is likely
not the right solution upstream; I’m sharing it only to help reason about where the
cycle can be broken. I can post the patch if people think it’s useful for the discussion.

Link: https://access.redhat.com/solutions/7016754 [1]
Fixes: c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION")
Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
---
 fs/nfs/file.c          |  6 +++---
 fs/nfs/write.c         | 10 +++++-----
 include/linux/nfs_fs.h |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..e556a16ce95b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -487,11 +487,11 @@ static void nfs_invalidate_folio(struct folio *folio, size_t offset,
 	dfprintk(PAGECACHE, "NFS: invalidate_folio(%lu, %zu, %zu)\n",
 		 folio->index, offset, length);
 
 	/* Cancel any unstarted writes on this page */
 	if (offset != 0 || length < folio_size(folio))
-		nfs_wb_folio(inode, folio);
+		nfs_wb_folio(inode, folio, true);
 	else
 		nfs_wb_folio_cancel(inode, folio);
 	folio_wait_private_2(folio); /* [DEPRECATED] */
 	trace_nfs_invalidate_folio(inode, folio_pos(folio) + offset, length);
 }
@@ -509,11 +509,11 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
 	/* If the private flag is set, then the folio is not freeable */
 	if (folio_test_private(folio)) {
 		if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
 		    current_is_kswapd() || current_is_kcompactd())
 			return false;
-		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+		if (nfs_wb_folio(folio->mapping->host, folio, false) < 0)
 			return false;
 	}
 	return nfs_fscache_release_folio(folio, gfp);
 }
 
@@ -558,11 +558,11 @@ static int nfs_launder_folio(struct folio *folio)
 
 	dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
 		inode->i_ino, folio_pos(folio));
 
 	folio_wait_private_2(folio); /* [DEPRECATED] */
-	ret = nfs_wb_folio(inode, folio);
+	ret = nfs_wb_folio(inode, folio, true);
 	trace_nfs_launder_folio_done(inode, folio_pos(folio),
 			folio_size(folio), ret);
 	return ret;
 }
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..bc541a192197 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1059,11 +1059,11 @@ static struct nfs_page *nfs_try_to_update_request(struct folio *folio,
 	 * nfs_lock_and_join_requests() cannot preserve
 	 * commit flags, so we have to replay the write.
 	 */
 	nfs_mark_request_dirty(req);
 	nfs_unlock_and_release_request(req);
-	error = nfs_wb_folio(folio->mapping->host, folio);
+	error = nfs_wb_folio(folio->mapping->host, folio, true);
 	trace_nfs_try_to_update_request_done(folio_inode(folio), offset, bytes, error);
 	return (error < 0) ? ERR_PTR(error) : NULL;
 }
 
 /*
@@ -1137,11 +1137,11 @@ int nfs_flush_incompatible(struct file *file, struct folio *folio)
 			do_flush |= l_ctx->lockowner != current->files;
 		}
 		nfs_release_request(req);
 		if (!do_flush)
 			return 0;
-		status = nfs_wb_folio(folio->mapping->host, folio);
+		status = nfs_wb_folio(folio->mapping->host, folio, true);
 	} while (status == 0);
 	return status;
 }
 
 /*
@@ -2030,11 +2030,11 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
  * @folio: pointer to folio
  *
  * Assumes that the folio has been locked by the caller, and will
  * not unlock it.
  */
-int nfs_wb_folio(struct inode *inode, struct folio *folio)
+int nfs_wb_folio(struct inode *inode, struct folio *folio, bool sync)
 {
 	loff_t range_start = folio_pos(folio);
 	size_t len = folio_size(folio);
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_ALL,
@@ -2055,11 +2055,11 @@ int nfs_wb_folio(struct inode *inode, struct folio *folio)
 			continue;
 		}
 		ret = 0;
 		if (!folio_test_private(folio))
 			break;
-		ret = nfs_commit_inode(inode, FLUSH_SYNC);
+		ret = nfs_commit_inode(inode, sync ? FLUSH_SYNC: 0);
 		if (ret < 0)
 			goto out_error;
 	}
 out_error:
 	trace_nfs_writeback_folio_done(inode, range_start, len, ret);
@@ -2078,11 +2078,11 @@ int nfs_migrate_folio(struct address_space *mapping, struct folio *dst,
 	 *        that we can safely release the inode reference while holding
 	 *        the folio lock.
 	 */
 	if (folio_test_private(src)) {
 		if (mode == MIGRATE_SYNC)
-			nfs_wb_folio(src->mapping->host, src);
+			nfs_wb_folio(src->mapping->host, src, true);
 		if (folio_test_private(src))
 			return -EBUSY;
 	}
 
 	if (folio_test_private_2(src)) { /* [DEPRECATED] */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..295bc6214750 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -634,11 +634,11 @@ extern int  nfs_update_folio(struct file *file, struct folio *folio,
  * Try to write back everything synchronously (but check the
  * return value!)
  */
 extern int nfs_sync_inode(struct inode *inode);
 extern int nfs_wb_all(struct inode *inode);
-extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio(struct inode *inode, struct folio *folio, bool sync);
 int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
 extern int  nfs_commit_inode(struct inode *, int);
 extern struct nfs_commit_data *nfs_commitdata_alloc(void);
 extern void nfs_commit_free(struct nfs_commit_data *data);
 void nfs_commit_begin(struct nfs_mds_commit_info *cinfo);
-- 
2.34.3

Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

Posted by Trond Myklebust 1 month, 1 week ago

On Tue, 2025-12-30 at 15:17 +0800, Wang Zhaolong wrote:
> Hi all,
> 
> I’d like to start an RFC discussion about a hung-task/deadlock that
> we hit in
> production-like testing on NFSv4.1 clients under server outage +
> memory
> pressure. The system becomes stuck even after the server/network is
> restored.
> 
> The scenario is:
> 
> - NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style
> workload)
> - server outage (restart/power-off) and/or network blackhole
> - client under significant memory pressure / reclaim activity
> (observed in the
>   traces below)
> 
> The observed behavior is a deadlock cycle involving:
> 
> - v4.1 session slot table “draining” (NFS4_SLOT_TBL_DRAINING)
> - state manager thread creation via kthread_run()
> - kthreadd entering direct reclaim and getting stuck in NFS
> commit/writeback paths
> - non-privileged RPC tasks sleeping on slot table waitq due to
> draining
> 
> Below is the call-chain I reconstructed from traces (three key
> participants):
> 
> P1: sunrpc worker 1 (error handling triggers session recovery and
> tries to startstate manager)
> rpc_exit_task
>   nfs_writeback_done
>     nfs4_write_done
>       nfs4_sequence_done
>         nfs41_sequence_process
>           // status error, goto session_recover
>           set_bit(NFS4_SLOT_TBL_DRAINING, &session-
> >fc_slot_table.slot_tbl_state)  <1>
>           nfs4_schedule_session_recovery
>             nfs4_schedule_state_manager
>               kthread_run  // - Create a state manager thread to
> release the draining slots
>                 kthread_create_on_node
>                   __kthread_create_on_node
>                     wait_for_completion(&done);   <2> wait for <3>
> 
> P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS
> folios and blocks in commit wait)
> kthreadd
>  kernel_thread
>   kernel_clone
>    copy_process
>     dup_task_struct
>      alloc_thread_stack_node
>       __vmalloc_node_range
>        __vmalloc_area_node
>         vm_area_alloc_pages
>          alloc_pages_bulk_array_mempolicy
>           __alloc_pages_bulk
>            __alloc_pages
>             __perform_reclaim
>              try_to_free_pages
>               do_try_to_free_pages
>                shrink_zones
>                 shrink_node
>                  shrink_node_memcgs
>                   shrink_lruvec
>                    shrink_inactive_list
>                     shrink_folio_list
>                      filemap_release_folio
>                       nfs_release_folio
>                        nfs_wb_folio
>                         folio PG_private !PG_writeback !PG_dirty
>                          nfs_commit_inode(inode, FLUSH_SYNC);
>                           __nfs_commit_inode
>                            nfs_generic_commit_list
>                             nfs_commit_list
>                              nfs_initiate_commit
>                               rpc_run_task  // Async task
>                            wait_on_commit            <3> wait for <4>
> 
> P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
> __rpc_execute
>   nfs4_setup_sequence
>     // if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto
> sleep
>     rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL);    <4>  blocked
> by <1>
> 
> This forms a deadlock:
> 
> - <1> enables draining; non-privileged requests then block at <4>
> - recovery path attempts to create the state manager thread, but
>   blocks at <2> waiting for kthreadd
> - kthreadd is blocked at <3> waiting for commit progress /
> completion,
>   but commit/RPC progress is impeded because requests are stuck
> behind draining at <4>
> - once in this state, restoring the server/network does not resolve
> the deadlock
> 
> I suspect this deadlock became possible after the following mainline
> change that
> freezes the session table immediately on NFS4ERR_BADSESSION (and
> similar error paths):
> 
> c907e72f58ed ("NFSv4.1: freeze the session table upon receiving
> NFS4ERR_BADSESSION")
> 
> It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:
> 
> Questions:
> 
> 1. Has anyone else observed a similar deadlock involving slot table
> draining + memory
>    reclaim? It looks like a similar issue might have been reported
> before — see
>    SUSE Bugzilla #1211527. [1]
> 2. Is it intended that kthreadd (or other critical kernel threads)
> may block in
>    nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
> 3. Is there an established way to ensure recovery threads can always
> be created even
>    under severe memory pressure (e.g., reserve resources, GFP flags,
> or moving state
>    manager creation out of contexts that can trigger reclaim)?
> 
> I wrote a local patch purely as a discussion starter. I realize this
> approach is likely
> not the right solution upstream; I’m sharing it only to help reason
> about where the
> cycle can be broken. I can post the patch if people think it’s useful
> for the discussion.

I think we need to treat nfs_release_folio() as being different from
nfs_wb_folio() altogether. There are too many different ways in which
waiting on I/O can cause deadlocks, including waiting on the writeback
to complete. My suggestion is therefore that we just make this simple,
and see it as a hint that we should kick off either writeback or a
commit, but not that we should be waiting for it to complete.
So how about the following?

8<------------------------------------------------------------------
From 3533434037066b610d50e7bd36f3525ace296928 Mon Sep 17 00:00:00 2001
Message-ID: <3533434037066b610d50e7bd36f3525ace296928.1767204181.git.trond.myklebust@hammerspace.com>
From: Trond Myklebust <trond.myklebust@hammerspace.com>
Date: Wed, 31 Dec 2025 11:42:31 -0500
Subject: [PATCH] NFS: Fix a deadlock involving nfs_release_folio()

Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
waiting on kthreadd, which is attempting to reclaim memory by calling
nfs_release_folio(). The latter cannot make progress due to state
recovery being needed.

It seems that the only safe thing to do here is to kick off a writeback
of the folio, without waiting for completion, or else kicking off an
asynchronous commit.

Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/file.c          |  3 ++-
 fs/nfs/nfstrace.h      |  3 +++
 fs/nfs/write.c         | 32 ++++++++++++++++++++++++++++++++
 include/linux/nfs_fs.h |  1 +
 4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..d1c138a416cf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -511,7 +511,8 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
 		if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
 		    current_is_kswapd() || current_is_kcompactd())
 			return false;
-		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+		if (nfs_wb_folio_reclaim(folio->mapping->host, folio) < 0 ||
+		    folio_test_private(folio))
 			return false;
 	}
 	return nfs_fscache_release_folio(folio, gfp);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 6ce55e8e6b67..9f9ce4a565ea 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1062,6 +1062,9 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
 DEFINE_NFS_FOLIO_EVENT(nfs_aop_readpage);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_aop_readpage_done);
 
+DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio_reclaim);
+DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_reclaim_done);
+
 DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..5de9ec6c72a2 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2024,6 +2024,38 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
 	return ret;
 }
 
+/**
+ * nfs_wb_folio_reclaim - Write back all requests on one page
+ * @inode: pointer to page
+ * @folio: pointer to folio
+ *
+ * Assumes that the folio has been locked by the caller
+ */
+int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio)
+{
+	loff_t range_start = folio_pos(folio);
+	size_t len = folio_size(folio);
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_ALL,
+		.nr_to_write = 0,
+		.range_start = range_start,
+		.range_end = range_start + len - 1,
+		.for_sync = 1,
+	};
+	int ret;
+
+	if (folio_test_writeback(folio))
+		return -EBUSY;
+	if (folio_clear_dirty_for_io(folio)) {
+		trace_nfs_writeback_folio_reclaim(inode, range_start, len);
+		ret = nfs_writepage_locked(folio, &wbc);
+		trace_nfs_writeback_folio_reclaim_done(inode, range_start, len,
+						       ret);
+	} else
+		nfs_commit_inode(inode, 0);
+	return ret;
+}
+
 /**
  * nfs_wb_folio - Write back all requests on one page
  * @inode: pointer to page
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..8dd79a3f3d66 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -637,6 +637,7 @@ extern int  nfs_update_folio(struct file *file, struct folio *folio,
 extern int nfs_sync_inode(struct inode *inode);
 extern int nfs_wb_all(struct inode *inode);
 extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio);
 int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
 extern int  nfs_commit_inode(struct inode *, int);
 extern struct nfs_commit_data *nfs_commitdata_alloc(void);
-- 
2.52.0

Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

Posted by Wang Zhaolong 1 month ago

Hi Trond,

Thanks a lot for taking the time to look into this and for the patch proposal.

> 
> I think we need to treat nfs_release_folio() as being different from
> nfs_wb_folio() altogether. There are too many different ways in which
> waiting on I/O can cause deadlocks, including waiting on the writeback
> to complete. My suggestion is therefore that we just make this simple,
> and see it as a hint that we should kick off either writeback or a
> commit, but not that we should be waiting for it to complete.
> So how about the following?
> 

I agree with the overall direction.Avoiding any synchronous waits from
->release_folio() makes a lot of sense

> 8<------------------------------------------------------------------
>  From 3533434037066b610d50e7bd36f3525ace296928 Mon Sep 17 00:00:00 2001
> Message-ID: <3533434037066b610d50e7bd36f3525ace296928.1767204181.git.trond.myklebust@hammerspace.com>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> Date: Wed, 31 Dec 2025 11:42:31 -0500
> Subject: [PATCH] NFS: Fix a deadlock involving nfs_release_folio()
> 
> Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
> waiting on kthreadd, which is attempting to reclaim memory by calling
> nfs_release_folio(). The latter cannot make progress due to state
> recovery being needed.
> 
> It seems that the only safe thing to do here is to kick off a writeback
> of the folio, without waiting for completion, or else kicking off an
> asynchronous commit.
> 
> Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com>
> Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>   fs/nfs/file.c          |  3 ++-
>   fs/nfs/nfstrace.h      |  3 +++
>   fs/nfs/write.c         | 32 ++++++++++++++++++++++++++++++++
>   include/linux/nfs_fs.h |  1 +
>   4 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index d020aab40c64..d1c138a416cf 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -511,7 +511,8 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
>   		if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
>   		    current_is_kswapd() || current_is_kcompactd())
>   			return false;
> -		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
> +		if (nfs_wb_folio_reclaim(folio->mapping->host, folio) < 0 ||
> +		    folio_test_private(folio))
>   			return false;
>   	}
>   	return nfs_fscache_release_folio(folio, gfp);
> diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> index 6ce55e8e6b67..9f9ce4a565ea 100644
> --- a/fs/nfs/nfstrace.h
> +++ b/fs/nfs/nfstrace.h
> @@ -1062,6 +1062,9 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
>   DEFINE_NFS_FOLIO_EVENT(nfs_aop_readpage);
>   DEFINE_NFS_FOLIO_EVENT_DONE(nfs_aop_readpage_done);
>   
> +DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio_reclaim);
> +DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_reclaim_done);
> +
>   DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
>   DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
>   
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 336c510f3750..5de9ec6c72a2 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -2024,6 +2024,38 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
>   	return ret;
>   }
>   
> +/**
> + * nfs_wb_folio_reclaim - Write back all requests on one page
> + * @inode: pointer to page
> + * @folio: pointer to folio
> + *
> + * Assumes that the folio has been locked by the caller
> + */
> +int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio)
> +{
> +	loff_t range_start = folio_pos(folio);
> +	size_t len = folio_size(folio);
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_ALL,
> +		.nr_to_write = 0,
> +		.range_start = range_start,
> +		.range_end = range_start + len - 1,
> +		.for_sync = 1,
> +	};
> +	int ret;

One small thing: in nfs_wb_folio_reclaim(), the else
nfs_commit_inode(inode, 0); branch seems to return an
uninitialized ret (unless I missed something).

> +
> +	if (folio_test_writeback(folio))
> +		return -EBUSY;
> +	if (folio_clear_dirty_for_io(folio)) {
> +		trace_nfs_writeback_folio_reclaim(inode, range_start, len);
> +		ret = nfs_writepage_locked(folio, &wbc);
> +		trace_nfs_writeback_folio_reclaim_done(inode, range_start, len,
> +						       ret);
> +	} else
> +		nfs_commit_inode(inode, 0);

Maybe init ret = 0, or return the value from
nfs_commit_inode() for tracing.

> +	return ret;
> +}
> +
>   /**
>    * nfs_wb_folio - Write back all requests on one page
>    * @inode: pointer to page
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index a6624edb7226..8dd79a3f3d66 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -637,6 +637,7 @@ extern int  nfs_update_folio(struct file *file, struct folio *folio,
>   extern int nfs_sync_inode(struct inode *inode);
>   extern int nfs_wb_all(struct inode *inode);
>   extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
> +extern int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio);
>   int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
>   extern int  nfs_commit_inode(struct inode *, int);
>   extern struct nfs_commit_data *nfs_commitdata_alloc(void);

Happy to test this with my reproducer once you post it.

Thanks,
Wang Zhaolong