[v2] pNFS: deadlock in pnfs_send_layoutreturn

[PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Ben Roberts 2 months, 1 week ago

On a HPC cluster running 5.14.0-611.9.1.el9.x86_64, regular deadlocks were
seen within pnfs_send_layoutreturn leading to userspace processes stuck in
uninterruptible sleep, ultimately requiring reboots to clear. This was
occurring frequently, sometimes multiple times per day on specific hosts
with heavy load.  Claude code was tasked with hunting down any potential
deadlocks within pnfs_send_layoutreturn, and identified the following
condition. This patch has been running in production on top of the EL9
kernel for over three months without any reoccurrence of the deadlock.

The pnfs_send_layoutreturn() function can deadlock when memory
allocation fails. The issue occurs in the error path where
pnfs_put_layout_hdr() is called, which may trigger
pnfs_layoutreturn_before_put_layout_hdr(), potentially causing
a recursive call back to pnfs_send_layoutreturn().

Call chain that triggers the deadlock:
1. pnfs_send_layoutreturn() - kzalloc() fails
2. Error path calls pnfs_put_layout_hdr(lo)
3. pnfs_put_layout_hdr() calls pnfs_layoutreturn_before_put_layout_hdr()
4. If NFS_LAYOUT_RETURN_REQUESTED is still set, attempts another
   layoutreturn, creating recursion/deadlock

The fix ensures that NFS_LAYOUT_RETURN_REQUESTED is cleared in the
allocation failure path before calling pnfs_put_layout_hdr(). This
prevents pnfs_layoutreturn_before_put_layout_hdr() from attempting
another layout return, breaking the recursion cycle.

v2 fixes a syntax error introduced while composing the original mail.

Signed-off-by: Ben Roberts <ben.roberts@gsacapital.com>
Assisted-by: Claude:claude-sonnet-4-5
---
 fs/nfs/pnfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index bc13d1e69449..47bda53b2b3a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1361,6 +1361,7 @@ pnfs_send_layoutreturn(struct pnfs_layout_hdr *lo,
 	if (unlikely(lrp == NULL)) {
 		status = -ENOMEM;
 		spin_lock(&ino->i_lock);
+		pnfs_clear_layoutreturn_info(lo);
 		pnfs_clear_layoutreturn_waitbit(lo);
 		spin_unlock(&ino->i_lock);
 		put_cred(cred);
--
2.43.0

For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

Re: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Benjamin Coddington 2 months ago

Hi Ben,

Did you reproduce and diagnose this problem on a recent upstream kernel
version?  If not, you probably want to report the issue to your Red Hat
support channel directly instead of sending fixes upstream because the
maintainers will want to see patches and fixes against the upstream code.
The code in 5.14.0-611.9.1.el9 may have different behaviors.

Ben

On 8 Apr 2026, at 8:25, Ben Roberts wrote:

> On a HPC cluster running 5.14.0-611.9.1.el9.x86_64, regular deadlocks were
> seen within pnfs_send_layoutreturn leading to userspace processes stuck in
> uninterruptible sleep, ultimately requiring reboots to clear. This was
> occurring frequently, sometimes multiple times per day on specific hosts
> with heavy load.  Claude code was tasked with hunting down any potential
> deadlocks within pnfs_send_layoutreturn, and identified the following
> condition. This patch has been running in production on top of the EL9
> kernel for over three months without any reoccurrence of the deadlock.
>
> The pnfs_send_layoutreturn() function can deadlock when memory
> allocation fails. The issue occurs in the error path where
> pnfs_put_layout_hdr() is called, which may trigger
> pnfs_layoutreturn_before_put_layout_hdr(), potentially causing
> a recursive call back to pnfs_send_layoutreturn().
>
> Call chain that triggers the deadlock:
> 1. pnfs_send_layoutreturn() - kzalloc() fails
> 2. Error path calls pnfs_put_layout_hdr(lo)
> 3. pnfs_put_layout_hdr() calls pnfs_layoutreturn_before_put_layout_hdr()
> 4. If NFS_LAYOUT_RETURN_REQUESTED is still set, attempts another
>    layoutreturn, creating recursion/deadlock
>
> The fix ensures that NFS_LAYOUT_RETURN_REQUESTED is cleared in the
> allocation failure path before calling pnfs_put_layout_hdr(). This
> prevents pnfs_layoutreturn_before_put_layout_hdr() from attempting
> another layout return, breaking the recursion cycle.
>
> v2 fixes a syntax error introduced while composing the original mail.
>
> Signed-off-by: Ben Roberts <ben.roberts@gsacapital.com>
> Assisted-by: Claude:claude-sonnet-4-5
> ---
>  fs/nfs/pnfs.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index bc13d1e69449..47bda53b2b3a 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1361,6 +1361,7 @@ pnfs_send_layoutreturn(struct pnfs_layout_hdr *lo,
>  	if (unlikely(lrp == NULL)) {
>  		status = -ENOMEM;
>  		spin_lock(&ino->i_lock);
> +		pnfs_clear_layoutreturn_info(lo);
>  		pnfs_clear_layoutreturn_waitbit(lo);
>  		spin_unlock(&ino->i_lock);
>  		put_cred(cred);
> --
> 2.43.0
>
> For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice
>
> This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
> If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
> This communication is for informational purposes only.
> It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
> Any comments or statements made herein do not necessarily reflect those of GSA Capital.
> GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
> GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

RE: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Ben Roberts 1 month, 4 weeks ago

Hi Ben,

> Did you reproduce and diagnose this problem on a recent upstream kernel
> version? 

I'm not able to switch the production systems to a more recent kernel at
this time, and don't have a reliable way to reproduce the issue in the
wild without risking production systems back on an unpatched kernel. The
best evidence I have that this patch is needed, is that we were seeing
this deadlock occur repeatedly under high loads and memory pressure last
year before applying this patch locally. It was rolled out to all
production systems in early Jan and we have not seen a single reoccurrence
since. The relevant code paths look similar between the modified EL9
kernel and the current git HEAD.

I've spent all this week trying to devise a precise reproduction (with a
lot of help from an LLM since I'm not that familar with kernel
development) on both 7.0.0-rc7 (9a9c8ce300cd) and 5.14.0-611.9.1 kernels
to definitively prove this patch is needed, but without success. The
initial analysis suggested the deadlock might be triggered from a single
process via a recursive call. This theory has been ruled out; all calling
paths triggered by a single process are guarded in such a way that a
recursive call does not happen. Revised theory for why this patch helps
is that the deadlock is subject to an inter-process race, where one
process suffers a memory allocation failure, causing a second process to
become a victim of bad state, leading to an unserviceable RPC request.
The timing appears to be very sensitive and hard to reproduce on demand.

The LLM-assisted analysis follows, and to my non-expert eyes seems
compelling.

Thanks,
Ben

---

When pnfs_send_layoutreturn() fails to allocate memory for the layoutreturn
structure, it returns -ENOMEM after calling pnfs_clear_layoutreturn_waitbit
to wake waiting processes. However, it fails to clear the
NFS_LAYOUT_RETURN_REQUESTED flag via pnfs_clear_layoutreturn_info().

This creates a race condition where processes woken by
pnfs_clear_layoutreturn_waitbit will see NFS_LAYOUT_RETURN_REQUESTED still
set and may attempt another layoutreturn on the corrupted layout state,
leading to hung tasks in rpc_wait_bit_killable().

The bug manifests during memory pressure when:
1. Process A calls pnfs_send_layoutreturn(), kzalloc fails
2. Error path calls pnfs_clear_layoutreturn_waitbit() (wakes waiters)
3. Error path does NOT call pnfs_clear_layoutreturn_info() (flag remains
   set)
4. Process B wakes from wait_on_bit() at _pnfs_return_layout:1484
5. Process B calls pnfs_put_layout_hdr(), refcount reaches zero
6. pnfs_layoutreturn_before_put_layout_hdr() checks flag at line 1400
7. Flag is still set, so Process B calls pnfs_send_layoutreturn() (async)
8. RPC operates on inconsistent/dying layout state
9. Process B hangs indefinitely in rpc_wait_bit_killable()

The fix ensures that when allocation fails in the error path, both the
waitbit AND the layoutreturn info are cleared, so waiting processes wake
to consistent state and do not attempt operations on the corrupted layout.

  The Deadlock Mechanism                                                             
                                                            
  Thread A (allocation failure):                                             
  // Line 1360-1368 in pnfs_send_layoutreturn()             
  if (unlikely(lrp == NULL)) {
      status = -ENOMEM;
      spin_lock(&ino->i_lock);
      pnfs_clear_layoutreturn_waitbit(lo);  // Wakes Thread B
      spin_unlock(&ino->i_lock);
      // BUG: NFS_LAYOUT_RETURN_REQUESTED still set!
      put_cred(cred);
      pnfs_put_layout_hdr(lo);  // Decrements refcount
      goto out;
  }

  Thread B (wakes up):
  // In _pnfs_return_layout() around line 1458
  if (test_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags)) {
      spin_unlock(&ino->i_lock);
      if (wait_on_bit(...))  // Thread B was blocked HERE
          goto out_put_layout_hdr;  // Now wakes up and goes here
      spin_lock(&ino->i_lock);
  }
  // ... continues ...
  wait_on_bit(...);  // Line 1484 - passes through (bit already clear)
  out_put_layout_hdr:
  pnfs_free_lseg_list(&tmp_list);
  pnfs_put_layout_hdr(lo);  // Line 1487 - THE CRITICAL CALL

  Inside Thread B's pnfs_put_layout_hdr() call (line 306):
  void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo)
  {
      inode = lo->plh_inode;
      pnfs_layoutreturn_before_put_layout_hdr(lo);  // 313 Called FIRST

      if (refcount_dec_and_lock(&lo->plh_refcount, &inode->i_lock)) {
          // If refcount reaches 0, layout is FREED here
          pnfs_detach_layout_hdr(lo);
          pnfs_free_layout_hdr(lo);  // Layout memory FREED!
      }
  }

  Inside pnfs_layoutreturn_before_put_layout_hdr() (line 1395):
  static void
  pnfs_layoutreturn_before_put_layout_hdr(struct pnfs_layout_hdr *lo)
  {
      if (!test_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags))
          return;  // Line 1399-1400 Would early return if flag was clear

      // BUG: Flag is STILL SET, so continues!
      spin_lock(&inode->i_lock);
      if (pnfs_layout_need_return(lo)) {
          ...
          send = pnfs_prepare_layoutreturn(lo, &stateid, &cred, &iomode);
          spin_unlock(&inode->i_lock);
          if (send) {
              // Sends ANOTHER layoutreturn RPC!
              pnfs_send_layoutreturn(lo, &stateid, &cred, iomode,
                                     PNFS_FL_LAYOUTRETURN_ASYNC);  // 1412
          }
      }
  }

  Why It Deadlocks

  The key issue is timing and state corruption:

  1. Thread B initiates async RPC at line 1412 that references layout lo
  2. Then Thread B returns to pnfs_put_layout_hdr() line 315
  3. If refcount reaches 0: Layout is DETACHED and FREED (lines 318-323)
  4. But the RPC from step 1 is still in flight!

  The RPC is now operating on a layout structure that may be:
  - Partially freed
  - Corrupted
  - Have invalid stateid
  - Have detached inode reference

  When the RPC machinery tries to complete this operation, it hits
  inconsistent state and hangs in rpc_wait_bit_killable() waiting for an
  RPC response that can never complete properly because the underlying
  state is corrupted.

---

Example stack trace for a deadlocked process:

[<0>] rpc_wait_bit_killable+0xd/0x60 [sunrpc]
[<0>] __rpc_execute+0x13a/0x300 [sunrpc]
[<0>] rpc_execute+0xc5/0xf0 [sunrpc]
[<0>] rpc_run_task+0x14d/0x1c0 [sunrpc]
[<0>] nfs4_proc_layoutreturn+0x14f/0x270 [nfsv4]
[<0>] pnfs_send_layoutreturn+0x119/0x190 [nfsv4]
[<0>] _pnfs_return_layout+0x1b6/0x280 [nfsv4]
[<0>] nfs4_evict_inode+0x6d/0x70 [nfsv4]
[<0>] evict+0xcc/0x1d0
[<0>] dispose_list+0x48/0x70
[<0>] evict_inodes+0x1a0/0x1b0
[<0>] generic_shutdown_super+0x37/0x100
[<0>] kill_anon_super+0x12/0x40
[<0>] nfs_kill_super+0x22/0x40 [nfs]
[<0>] deactivate_locked_super+0x2e/0xb0
[<0>] cleanup_mnt+0x100/0x160
[<0>] task_work_run+0x59/0x90
[<0>] do_exit+0x264/0x480
[<0>] do_group_exit+0x2d/0x90
[<0>] get_signal+0x839/0x860
[<0>] arch_do_signal_or_restart+0x25/0x100
[<0>] exit_to_user_mode_loop+0x9c/0x130
[<0>] exit_to_user_mode_prepare+0xb9/0x100
[<0>] syscall_exit_to_user_mode+0x12/0x40
[<0>] do_syscall_64+0x6b/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80

Ben Roberts 

For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

Re: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Benjamin Coddington 1 month, 4 weeks ago

On 17 Apr 2026, at 3:17, Ben Roberts wrote:

> Hi Ben,
>
>> Did you reproduce and diagnose this problem on a recent upstream kernel
>> version?
>
> I'm not able to switch the production systems to a more recent kernel at
> this time, and don't have a reliable way to reproduce the issue in the
> wild without risking production systems back on an unpatched kernel. The
> best evidence I have that this patch is needed, is that we were seeing
> this deadlock occur repeatedly under high loads and memory pressure last
> year before applying this patch locally. It was rolled out to all
> production systems in early Jan and we have not seen a single reoccurrence
> since. The relevant code paths look similar between the modified EL9
> kernel and the current git HEAD.
>
> I've spent all this week trying to devise a precise reproduction (with a
> lot of help from an LLM since I'm not that familar with kernel
> development) on both 7.0.0-rc7 (9a9c8ce300cd) and 5.14.0-611.9.1 kernels
> to definitively prove this patch is needed, but without success. The
> initial analysis suggested the deadlock might be triggered from a single
> process via a recursive call. This theory has been ruled out; all calling
> paths triggered by a single process are guarded in such a way that a
> recursive call does not happen. Revised theory for why this patch helps
> is that the deadlock is subject to an inter-process race, where one
> process suffers a memory allocation failure, causing a second process to
> become a victim of bad state, leading to an unserviceable RPC request.
> The timing appears to be very sensitive and hard to reproduce on demand.
>
> The LLM-assisted analysis follows, and to my non-expert eyes seems
> compelling.
>
> Thanks,
> Ben

The kzalloc failure is definitely a rarely-used/tested path, so its possible
there's an issue there no one has seen yet, but from what I can see it looks
like every call to pnfs_send_layoutreturn() first calls
pnfs_prepare_layoutreturn(), which already clears
NFS_LAYOUT_RETURN_REQUESTED.  I don't see how you can end up with another
proccess seeing the flag.

There's at least one body of work in this area that your systems
don't yet have:
https://lore.kernel.org/linux-nfs/20240613050055.854323-1-trond.myklebust@hammerspace.com/

Again, I strongly recommend engaging RH support here.
Ben

RE: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Ben Roberts 1 month, 3 weeks ago

Hi Ben,

> The kzalloc failure is definitely a rarely-used/tested path, so its possible
> there's an issue there no one has seen yet, but from what I can see it looks
> like every call to pnfs_send_layoutreturn() first calls
> pnfs_prepare_layoutreturn(), which already clears
> NFS_LAYOUT_RETURN_REQUESTED. I don't see how you can end up with another
> proccess seeing the flag.

Understood, thank you for the feedback here. I withdraw the patch request and
will take this back to the drawing board. Likely removing the patch from local
systems and looking to capture more useful diagnostics if/when the problem
reoccurs. 

> There's at least one body of work in this area that your systems
> don't yet have:
> https://lore.kernel.org/linux-nfs/20240613050055.854323-1-trond.myklebust@hammerspace.com/

For what it's worth, I checked this patch set and the changes are almost
completely present in the EL source. Patch 2 is only partially applied. 

Ben Roberts

For details of how GSA uses your personal information, please see our Privacy Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and delete it from your computer and you must not use, disclose, distribute, copy, print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct Authority and is registered in England and Wales at Stratton House, 5 Stratton Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same address, number 5320529.

Re: [PATCH v2] pNFS: deadlock in pnfs_send_layoutreturn

Posted by Trond Myklebust 1 month, 3 weeks ago

On Wed, 2026-04-22 at 10:46 +0000, Ben Roberts wrote:
> Hi Ben,
> 
> > The kzalloc failure is definitely a rarely-used/tested path, so its
> > possible
> > there's an issue there no one has seen yet, but from what I can see
> > it looks
> > like every call to pnfs_send_layoutreturn() first calls
> > pnfs_prepare_layoutreturn(), which already clears
> > NFS_LAYOUT_RETURN_REQUESTED. I don't see how you can end up with
> > another
> > proccess seeing the flag.
> 

I had first applied this patch, however when looking more closely, it
seems that it will also be defeated by the by the loop in
pnfs_clear_layoutreturn_info() that checks if there are any layouts
that need returning, and resets NFS_LAYOUT_RETURN_REQUESTED.

> Understood, thank you for the feedback here. I withdraw the patch
> request and
> will take this back to the drawing board. Likely removing the patch
> from local
> systems and looking to capture more useful diagnostics if/when the
> problem
> reoccurs. 
> 
> > There's at least one body of work in this area that your systems
> > don't yet have:
> > https://lore.kernel.org/linux-nfs/20240613050055.854323-1-trond.myklebust@hammerspace.com/
> 
> For what it's worth, I checked this patch set and the changes are
> almost
> completely present in the EL source. Patch 2 is only partially
> applied. 
> 
> Ben Roberts