[v1] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

[PATCH] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

Posted by Michael Bommarito 1 day, 17 hours ago

nfsd4_ssc_expire_umount() walks nn->nfsd_ssc_mount_list with
list_for_each_entry_safe(ni, tmp, ...).  For each expired entry it
sets nsui_busy = true, drops nfsd_ssc_lock to run mntput() on the
source vfsmount, then reacquires the lock to list_del + kfree the
entry and continue iterating via the macro's saved tmp pointer.

The nsui_busy flag protects the current ni from concurrent
nfsd4_ssc_setup_dul() finders during the lock-drop window, but it
does not pin tmp.  Another nfsd RPC thread that fails its source-
server mount and reaches nfsd4_ssc_cancel_dul() will, during that
same window, take nfsd_ssc_lock, list_del + kfree its own ssc_umount
item, and release the lock.  If that item is the saved tmp of the
expire walk, the next iteration dereferences a freed
nfsd4_ssc_umount_item.

Reachability: triggered by any authenticated NFSv4.2 client that
can issue OP_COPY with cna_src.nl4_type = NL4_SERVER to a destination
nfsd built with CONFIG_NFSD_V4_2_INTER_SSC=y and started with
inter_copy_offload_enable=Y.  The client chooses the source-server
netaddr and can pick one that fails vfs_kern_mount() (unreachable,
RST after EXCHANGE_ID, etc.) to drive nfsd4_ssc_cancel_dul() into
the laundromat's lock-drop window.  Default Linux nfsd ships with
inter_copy_offload_enable=N, so the bug is reachable only on servers
where the administrator has explicitly opted into inter-SSC offload.

Restart the walk from the head after the mntput() unlock window so
no saved next pointer survives the lock-drop.  The list is bounded
by the number of active inter-server source mounts (typically small)
and the expire delayed-work runs periodically rather than per-IO,
so the restart is cheap.

Cc: stable@vger.kernel.org
Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
 fs/nfsd/nfs4state.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

Reproduced under QEMU/KVM with KASAN, three nfsd network namespaces
on a single host so the kernel client treats them as distinct
servers, and Linux fault injection forcing vfs_kern_mount allocations
inside the destination nfsd to fail.  This drives nfsd4_ssc_cancel_dul
into a tight loop concurrent with the laundromat workqueue.

Stock kernel:

  BUG: KASAN: slab-use-after-free in laundromat_main+0x1756/0x1be0
  Read of size 8 at addr ffff88800ce9b200 by task kworker/u16:3
  Workqueue: nfsd4 laundromat_main

  Allocated by task 229:
   nfsd4_interssc_connect+0x3f5/0xd90    (nfsd4_ssc_setup_dul, inlined)
   nfsd4_copy+0x117d/0x1a30
   nfsd4_proc_compound+0xbe9/0x23f0

  Freed by task 229:
   kfree+0x18f/0x520
   nfsd4_interssc_connect+0xaff/0xd90    (nfsd4_ssc_cancel_dul, inlined)
   nfsd4_copy+0x117d/0x1a30

  The buggy address belongs to the cache kmalloc-128 of size 128.
  Kernel panic - not syncing: Fatal exception

Patched kernel ran the equivalent workload to completion with the
inter-SSC code path exercised 21-22 times per run and no KASAN
report.

The fault-injection knobs are standard Linux testing infrastructure
(see Documentation/fault-injection/) exercising the existing failure
path in nfsd; no kernel source was modified.  The same primitive
class was previously addressed by the OPEN-error path fix in
__nfs42_ssc_open(); this patch closes the corresponding hole in
the laundromat-driven delayed-unmount path.


diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 6b9c399b89dfb..03582f15e3e7e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -6728,30 +6728,37 @@ static void nfsd4_ssc_shutdown_umount(struct nfsd_net *nn)
 static void nfsd4_ssc_expire_umount(struct nfsd_net *nn)
 {
 	bool do_wakeup = false;
-	struct nfsd4_ssc_umount_item *ni = NULL;
-	struct nfsd4_ssc_umount_item *tmp;
+	struct nfsd4_ssc_umount_item *ni;
 
+restart:
 	spin_lock(&nn->nfsd_ssc_lock);
-	list_for_each_entry_safe(ni, tmp, &nn->nfsd_ssc_mount_list, nsui_list) {
-		if (time_after(jiffies, ni->nsui_expire)) {
-			if (refcount_read(&ni->nsui_refcnt) > 1)
-				continue;
+	list_for_each_entry(ni, &nn->nfsd_ssc_mount_list, nsui_list) {
+		if (!time_after(jiffies, ni->nsui_expire))
+			break;
+		if (refcount_read(&ni->nsui_refcnt) > 1)
+			continue;
 
-			/* mark being unmount */
-			ni->nsui_busy = true;
-			spin_unlock(&nn->nfsd_ssc_lock);
-			mntput(ni->nsui_vfsmount);
-			spin_lock(&nn->nfsd_ssc_lock);
+		/* mark being unmount */
+		ni->nsui_busy = true;
+		spin_unlock(&nn->nfsd_ssc_lock);
+		mntput(ni->nsui_vfsmount);
+		spin_lock(&nn->nfsd_ssc_lock);
 
-			/* waiters need to start from begin of list */
-			list_del(&ni->nsui_list);
-			kfree(ni);
+		/* waiters need to start from begin of list */
+		list_del(&ni->nsui_list);
+		kfree(ni);
 
-			/* wakeup ssc_connect waiters */
-			do_wakeup = true;
-			continue;
-		}
-		break;
+		/* wakeup ssc_connect waiters */
+		do_wakeup = true;
+		/*
+		 * The list_for_each_entry_safe() saved-next pointer was
+		 * not pinned across the spin_unlock() above: a concurrent
+		 * nfsd4_ssc_cancel_dul() can free the next item under the
+		 * same spinlock while mntput() runs.  Restart the walk
+		 * from the head so no stale next is dereferenced.
+		 */
+		spin_unlock(&nn->nfsd_ssc_lock);
+		goto restart;
 	}
 	if (do_wakeup)
 		wake_up_all(&nn->nfsd_ssc_waitq);
-- 
2.53.0

Re: [PATCH] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

Posted by Chuck Lever 1 day, 3 hours ago

On Fri, May 22, 2026, at 9:41 PM, Michael Bommarito wrote:

> -			/* wakeup ssc_connect waiters */
> -			do_wakeup = true;
> -			continue;
> -		}
> -		break;
> +		/* wakeup ssc_connect waiters */
> +		do_wakeup = true;
> +		/*
> +		 * The list_for_each_entry_safe() saved-next pointer was
> +		 * not pinned across the spin_unlock() above: a concurrent
> +		 * nfsd4_ssc_cancel_dul() can free the next item under the
> +		 * same spinlock while mntput() runs.  Restart the walk
> +		 * from the head so no stale next is dereferenced.
> +		 */

Same observation as Jeff's:

The comment references list_for_each_entry_safe(), but the patched code
uses plain list_for_each_entry(). A reader looking only at the post-patch
source sees no _safe iterator anywhere in the function and has to
reconstruct the bug history to make sense of the comment.

The past-tense framing ("was not pinned") reinforces this: it describes a
removed code state rather than the current one.

Please send a v2 with this comment corrected.

> +		spin_unlock(&nn->nfsd_ssc_lock);
> +		goto restart;
>  	}
>  	if (do_wakeup)
>  		wake_up_all(&nn->nfsd_ssc_waitq);
> -- 
> 2.53.0

-- 
Chuck Lever

Re: [PATCH] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

Posted by Jeff Layton 1 day, 7 hours ago

On Fri, 2026-05-22 at 21:41 -0400, Michael Bommarito wrote:
> nfsd4_ssc_expire_umount() walks nn->nfsd_ssc_mount_list with
> list_for_each_entry_safe(ni, tmp, ...).  For each expired entry it
> sets nsui_busy = true, drops nfsd_ssc_lock to run mntput() on the
> source vfsmount, then reacquires the lock to list_del + kfree the
> entry and continue iterating via the macro's saved tmp pointer.
> 
> The nsui_busy flag protects the current ni from concurrent
> nfsd4_ssc_setup_dul() finders during the lock-drop window, but it
> does not pin tmp.  Another nfsd RPC thread that fails its source-
> server mount and reaches nfsd4_ssc_cancel_dul() will, during that
> same window, take nfsd_ssc_lock, list_del + kfree its own ssc_umount
> item, and release the lock.  If that item is the saved tmp of the
> expire walk, the next iteration dereferences a freed
> nfsd4_ssc_umount_item.
> 
> Reachability: triggered by any authenticated NFSv4.2 client that
> can issue OP_COPY with cna_src.nl4_type = NL4_SERVER to a destination
> nfsd built with CONFIG_NFSD_V4_2_INTER_SSC=y and started with
> inter_copy_offload_enable=Y.  The client chooses the source-server
> netaddr and can pick one that fails vfs_kern_mount() (unreachable,
> RST after EXCHANGE_ID, etc.) to drive nfsd4_ssc_cancel_dul() into
> the laundromat's lock-drop window.  Default Linux nfsd ships with
> inter_copy_offload_enable=N, so the bug is reachable only on servers
> where the administrator has explicitly opted into inter-SSC offload.
> 
> Restart the walk from the head after the mntput() unlock window so
> no saved next pointer survives the lock-drop.  The list is bounded
> by the number of active inter-server source mounts (typically small)
> and the expire delayed-work runs periodically rather than per-IO,
> so the restart is cheap.
> 
> Cc: stable@vger.kernel.org
> Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
>  fs/nfsd/nfs4state.c | 45 ++++++++++++++++++++++++++-------------------
>  1 file changed, 26 insertions(+), 19 deletions(-)
> 
> Reproduced under QEMU/KVM with KASAN, three nfsd network namespaces
> on a single host so the kernel client treats them as distinct
> servers, and Linux fault injection forcing vfs_kern_mount allocations
> inside the destination nfsd to fail.  This drives nfsd4_ssc_cancel_dul
> into a tight loop concurrent with the laundromat workqueue.
> 
> Stock kernel:
> 
>   BUG: KASAN: slab-use-after-free in laundromat_main+0x1756/0x1be0
>   Read of size 8 at addr ffff88800ce9b200 by task kworker/u16:3
>   Workqueue: nfsd4 laundromat_main
> 
>   Allocated by task 229:
>    nfsd4_interssc_connect+0x3f5/0xd90    (nfsd4_ssc_setup_dul, inlined)
>    nfsd4_copy+0x117d/0x1a30
>    nfsd4_proc_compound+0xbe9/0x23f0
> 
>   Freed by task 229:
>    kfree+0x18f/0x520
>    nfsd4_interssc_connect+0xaff/0xd90    (nfsd4_ssc_cancel_dul, inlined)
>    nfsd4_copy+0x117d/0x1a30
> 
>   The buggy address belongs to the cache kmalloc-128 of size 128.
>   Kernel panic - not syncing: Fatal exception
> 
> Patched kernel ran the equivalent workload to completion with the
> inter-SSC code path exercised 21-22 times per run and no KASAN
> report.
> 
> The fault-injection knobs are standard Linux testing infrastructure
> (see Documentation/fault-injection/) exercising the existing failure
> path in nfsd; no kernel source was modified.  The same primitive
> class was previously addressed by the OPEN-error path fix in
> __nfs42_ssc_open(); this patch closes the corresponding hole in
> the laundromat-driven delayed-unmount path.
> 
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 6b9c399b89dfb..03582f15e3e7e 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -6728,30 +6728,37 @@ static void nfsd4_ssc_shutdown_umount(struct nfsd_net *nn)
>  static void nfsd4_ssc_expire_umount(struct nfsd_net *nn)
>  {
>  	bool do_wakeup = false;
> -	struct nfsd4_ssc_umount_item *ni = NULL;
> -	struct nfsd4_ssc_umount_item *tmp;
> +	struct nfsd4_ssc_umount_item *ni;
>  
> +restart:
>  	spin_lock(&nn->nfsd_ssc_lock);
> -	list_for_each_entry_safe(ni, tmp, &nn->nfsd_ssc_mount_list, nsui_list) {
> -		if (time_after(jiffies, ni->nsui_expire)) {
> -			if (refcount_read(&ni->nsui_refcnt) > 1)
> -				continue;
> +	list_for_each_entry(ni, &nn->nfsd_ssc_mount_list, nsui_list) {
> +		if (!time_after(jiffies, ni->nsui_expire))
> +			break;
> +		if (refcount_read(&ni->nsui_refcnt) > 1)
> +			continue;
>  
> -			/* mark being unmount */
> -			ni->nsui_busy = true;
> -			spin_unlock(&nn->nfsd_ssc_lock);
> -			mntput(ni->nsui_vfsmount);
> -			spin_lock(&nn->nfsd_ssc_lock);
> +		/* mark being unmount */
> +		ni->nsui_busy = true;
> +		spin_unlock(&nn->nfsd_ssc_lock);
> +		mntput(ni->nsui_vfsmount);
> +		spin_lock(&nn->nfsd_ssc_lock);
>  
> -			/* waiters need to start from begin of list */
> -			list_del(&ni->nsui_list);
> -			kfree(ni);
> +		/* waiters need to start from begin of list */
> +		list_del(&ni->nsui_list);
> +		kfree(ni);
>  
> -			/* wakeup ssc_connect waiters */
> -			do_wakeup = true;
> -			continue;
> -		}
> -		break;
> +		/* wakeup ssc_connect waiters */
> +		do_wakeup = true;
> +		/*
> +		 * The list_for_each_entry_safe() saved-next pointer was

Comment is a bit confusing, given that you replaced
list_for_each_entry_safe() with list_for_each_entry().

> +		 * not pinned across the spin_unlock() above: a concurrent
> +		 * nfsd4_ssc_cancel_dul() can free the next item under the
> +		 * same spinlock while mntput() runs.  Restart the walk
> +		 * from the head so no stale next is dereferenced.
> +		 */
> +		spin_unlock(&nn->nfsd_ssc_lock);
> +		goto restart;
>  	}
>  	if (do_wakeup)
>  		wake_up_all(&nn->nfsd_ssc_waitq);

The bug and patch look correct to me though.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Re: [PATCH] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

Posted by Michael Bommarito 1 day, 7 hours ago

On Sat, May 23, 2026 at 6:55 AM Jeff Layton <jlayton@kernel.org> wrote:
> Comment is a bit confusing, given that you replaced
> list_for_each_entry_safe() with list_for_each_entry().

Sorry, that's left over from an earlier patch attempt that introduced
a different issue.  How would this comment look?

Concurrent nfsd4_ssc_cancel_dul() can free an item while spinlock is
dropped for mntput() above, so restart the walk from the head so no
stale pointer is followed.

Thanks,
Mike

Re: [PATCH] NFSD: restart ssc_expire_umount walk after dropping nfsd_ssc_lock

Posted by Jeff Layton 1 day, 7 hours ago

On Sat, 2026-05-23 at 07:02 -0400, Michael Bommarito wrote:
> On Sat, May 23, 2026 at 6:55 AM Jeff Layton <jlayton@kernel.org> wrote:
> > Comment is a bit confusing, given that you replaced
> > list_for_each_entry_safe() with list_for_each_entry().
> 
> Sorry, that's left over from an earlier patch attempt that introduced
> a different issue.  How would this comment look?
> 
> Concurrent nfsd4_ssc_cancel_dul() can free an item while spinlock is
> dropped for mntput() above, so restart the walk from the head so no
> stale pointer is followed.


Sure, looks good. FWIW, the "restart the loop after dropping the lock"
pattern is pretty common across the kernel, but it is good to lay out
the rationale in a comment.
-- 
Jeff Layton <jlayton@kernel.org>