[PATCH] lockd: pin next file across nlm_inspect_file lock-drop

Michael Bommarito posted 1 patch 1 day, 16 hours ago
There is a newer version of this series
fs/lockd/svcsubs.c | 61 +++++++++++++++++++++++++++++++---------------
1 file changed, 42 insertions(+), 19 deletions(-)
[PATCH] lockd: pin next file across nlm_inspect_file lock-drop
Posted by Michael Bommarito 1 day, 16 hours ago
nlm_traverse_files() walks each nlm_files[] hash bucket with
hlist_for_each_entry_safe(file, next, ...). For each matching file
it bumps f_count, drops nlm_file_mutex to run nlm_inspect_file()
(which may sleep walking blocks, shares, and the inode lock list),
then reacquires the mutex and decrements f_count before continuing
to the saved next.

The f_count bump pins the current file across the lock-drop, but
nothing pins next. Any nlmsvc thread that holds the last reference
on the file at next will, during that window, call
nlm_release_file() -> nlm_delete_file() under nlm_file_mutex,
hlist_del() it from the bucket, and kfree() it. When
nlm_traverse_files() reacquires the mutex and the macro reads the
next entry's f_list.next on the following iteration, the read lands
in the freed slab.

A naive restart-on-action variant would deadlock-spin against an
nlm_release_file holder: nlm_inspect_file() does not always drain
the file (it can return 1 with an RPC still holding f_count above
the cleanup threshold), and the outer predicate is_failover_file()
matches static attributes of the file, so a restart can keep
re-finding the same un-cleanable file until the external RPC ref
drops.

Pin the neighbour explicitly instead. Walk the bucket with two
locally-pinned cursors at a time: file (current, pinned by the
prior iteration's next bump) and next (one ahead). Drop file's pin
at the end of each iteration, then advance to next, which is still
alive because we hold its f_count above zero across the unlock.
This bounds the walk at O(N) per bucket and never observes a freed
neighbour. Factor the f_count/list/share/lock cleanup into a
helper so the no-match path also drops a stale empty file rather
than leaving it in the table.

Cc: stable@vger.kernel.org
Fixes: 01df9c5e918a ("LOCKD: Fix a deadlock in nlm_traverse_files()")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
 fs/lockd/svcsubs.c | 61 +++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 19 deletions(-)

Reproduced under UML + KASAN with a loopback NFSv3 mount, 768
concurrent POSIX fcntl(F_SETLKW) holders, and parallel writes to
/proc/fs/nfsd/unlock_filesystem forcing nlmsvc_unlock_all_by_sb()
to walk the table while clients churn locks.

Stock kernel:

  BUG: KASAN: slab-use-after-free in nlm_traverse_files+0x71d/0x9d0
  Read of size 8 at addr 0000000070314800 by task nlm-init-...

  Allocated by: nlm_lookup_file via nlm4svc_proc_lock
  Freed by:     another nlm_traverse_files instance freeing a
                file whose f_count dropped to zero during the
                nlm_inspect_file() unlock window

Patched UML kernel ran the same harness silently.

Pin-next was chosen over restart-on-action because the latter can
livelock when nlm_inspect_file() returns 1 with an RPC reference
still holding the file above the cleanup threshold and the outer
is_failover_file() predicate matching static attributes.


diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
index dd0214dcb6950..2bfa32207f10c 100644
--- a/fs/lockd/svcsubs.c
+++ b/fs/lockd/svcsubs.c
@@ -295,36 +295,59 @@ static void nlm_close_files(struct nlm_file *file)
 /*
  * Loop over all files in the file table.
  */
+static void nlm_file_release(struct nlm_file *file)
+{
+	if (list_empty(&file->f_blocks) && !file->f_locks
+	    && !file->f_shares && !file->f_count) {
+		hlist_del(&file->f_list);
+		nlm_close_files(file);
+		kfree(file);
+	}
+}
+
 static int
 nlm_traverse_files(void *data, nlm_host_match_fn_t match,
 		int (*is_failover_file)(void *data, struct nlm_file *file))
 {
-	struct hlist_node *next;
-	struct nlm_file	*file;
+	struct nlm_file *file, *next;
 	int i, ret = 0;
 
 	mutex_lock(&nlm_file_mutex);
 	for (i = 0; i < FILE_NRHASH; i++) {
-		hlist_for_each_entry_safe(file, next, &nlm_files[i], f_list) {
-			if (is_failover_file && !is_failover_file(data, file))
-				continue;
+		file = hlist_entry_safe(nlm_files[i].first,
+					struct nlm_file, f_list);
+		if (file)
 			file->f_count++;
-			mutex_unlock(&nlm_file_mutex);
-
-			/* Traverse locks, blocks and shares of this file
-			 * and update file->f_locks count */
-			if (nlm_inspect_file(data, file, match))
-				ret = 1;
+		while (file) {
+			/*
+			 * Pin the next neighbour before we drop the mutex
+			 * for nlm_inspect_file(); a concurrent
+			 * nlm_release_file() under the same mutex would
+			 * otherwise be free to unlink and kfree it during
+			 * the unlock window, leaving us to dereference a
+			 * freed slab when we walked to next afterwards.
+			 */
+			next = hlist_entry_safe(file->f_list.next,
+						struct nlm_file, f_list);
+			if (next)
+				next->f_count++;
+
+			if (!is_failover_file || is_failover_file(data, file)) {
+				mutex_unlock(&nlm_file_mutex);
+
+				/*
+				 * Traverse locks, blocks and shares of this
+				 * file and update file->f_locks count.
+				 */
+				if (nlm_inspect_file(data, file, match))
+					ret = 1;
+
+				mutex_lock(&nlm_file_mutex);
+			}
 
-			mutex_lock(&nlm_file_mutex);
 			file->f_count--;
-			/* No more references to this file. Let go of it. */
-			if (list_empty(&file->f_blocks) && !file->f_locks
-			 && !file->f_shares && !file->f_count) {
-				hlist_del(&file->f_list);
-				nlm_close_files(file);
-				kfree(file);
-			}
+			nlm_file_release(file);
+			file = next;
 		}
 	}
 	mutex_unlock(&nlm_file_mutex);
-- 
2.53.0
Re: [PATCH] lockd: pin next file across nlm_inspect_file lock-drop
Posted by Jeff Layton 1 day, 7 hours ago
On Fri, 2026-05-22 at 21:42 -0400, Michael Bommarito wrote:
> nlm_traverse_files() walks each nlm_files[] hash bucket with
> hlist_for_each_entry_safe(file, next, ...). For each matching file
> it bumps f_count, drops nlm_file_mutex to run nlm_inspect_file()
> (which may sleep walking blocks, shares, and the inode lock list),
> then reacquires the mutex and decrements f_count before continuing
> to the saved next.
> 
> The f_count bump pins the current file across the lock-drop, but
> nothing pins next. Any nlmsvc thread that holds the last reference
> on the file at next will, during that window, call
> nlm_release_file() -> nlm_delete_file() under nlm_file_mutex,
> hlist_del() it from the bucket, and kfree() it. When
> nlm_traverse_files() reacquires the mutex and the macro reads the
> next entry's f_list.next on the following iteration, the read lands
> in the freed slab.
> 
> A naive restart-on-action variant would deadlock-spin against an
> nlm_release_file holder: nlm_inspect_file() does not always drain
> the file (it can return 1 with an RPC still holding f_count above
> the cleanup threshold), and the outer predicate is_failover_file()
> matches static attributes of the file, so a restart can keep
> re-finding the same un-cleanable file until the external RPC ref
> drops.
> 
> Pin the neighbour explicitly instead. Walk the bucket with two
> locally-pinned cursors at a time: file (current, pinned by the
> prior iteration's next bump) and next (one ahead). Drop file's pin
> at the end of each iteration, then advance to next, which is still
> alive because we hold its f_count above zero across the unlock.
> This bounds the walk at O(N) per bucket and never observes a freed
> neighbour. Factor the f_count/list/share/lock cleanup into a
> helper so the no-match path also drops a stale empty file rather
> than leaving it in the table.
> 
> Cc: stable@vger.kernel.org
> Fixes: 01df9c5e918a ("LOCKD: Fix a deadlock in nlm_traverse_files()")
> Assisted-by: Claude:claude-opus-4-7
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
>  fs/lockd/svcsubs.c | 61 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 42 insertions(+), 19 deletions(-)
> 
> Reproduced under UML + KASAN with a loopback NFSv3 mount, 768
> concurrent POSIX fcntl(F_SETLKW) holders, and parallel writes to
> /proc/fs/nfsd/unlock_filesystem forcing nlmsvc_unlock_all_by_sb()
> to walk the table while clients churn locks.
> 
> Stock kernel:
> 
>   BUG: KASAN: slab-use-after-free in nlm_traverse_files+0x71d/0x9d0
>   Read of size 8 at addr 0000000070314800 by task nlm-init-...
> 
>   Allocated by: nlm_lookup_file via nlm4svc_proc_lock
>   Freed by:     another nlm_traverse_files instance freeing a
>                 file whose f_count dropped to zero during the
>                 nlm_inspect_file() unlock window
> 
> Patched UML kernel ran the same harness silently.
> 
> Pin-next was chosen over restart-on-action because the latter can
> livelock when nlm_inspect_file() returns 1 with an RPC reference
> still holding the file above the cleanup threshold and the outer
> is_failover_file() predicate matching static attributes.
> 
> 
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index dd0214dcb6950..2bfa32207f10c 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -295,36 +295,59 @@ static void nlm_close_files(struct nlm_file *file)
>  /*
>   * Loop over all files in the file table.
>   */
> +static void nlm_file_release(struct nlm_file *file)
> +{
> +	if (list_empty(&file->f_blocks) && !file->f_locks
> +	    && !file->f_shares && !file->f_count) {
> +		hlist_del(&file->f_list);
> +		nlm_close_files(file);
> +		kfree(file);
> +	}
> +}
> +
>  static int
>  nlm_traverse_files(void *data, nlm_host_match_fn_t match,
>  		int (*is_failover_file)(void *data, struct nlm_file *file))
>  {
> -	struct hlist_node *next;
> -	struct nlm_file	*file;
> +	struct nlm_file *file, *next;
>  	int i, ret = 0;
>  
>  	mutex_lock(&nlm_file_mutex);
>  	for (i = 0; i < FILE_NRHASH; i++) {
> -		hlist_for_each_entry_safe(file, next, &nlm_files[i], f_list) {
> -			if (is_failover_file && !is_failover_file(data, file))
> -				continue;
> +		file = hlist_entry_safe(nlm_files[i].first,
> +					struct nlm_file, f_list);
> +		if (file)
>  			file->f_count++;
> -			mutex_unlock(&nlm_file_mutex);
> -
> -			/* Traverse locks, blocks and shares of this file
> -			 * and update file->f_locks count */
> -			if (nlm_inspect_file(data, file, match))
> -				ret = 1;
> +		while (file) {
> +			/*
> +			 * Pin the next neighbour before we drop the mutex
> +			 * for nlm_inspect_file(); a concurrent
> +			 * nlm_release_file() under the same mutex would
> +			 * otherwise be free to unlink and kfree it during
> +			 * the unlock window, leaving us to dereference a
> +			 * freed slab when we walked to next afterwards.
> +			 */
> +			next = hlist_entry_safe(file->f_list.next,
> +						struct nlm_file, f_list);
> +			if (next)
> +				next->f_count++;
> +
> +			if (!is_failover_file || is_failover_file(data, file)) {
> +				mutex_unlock(&nlm_file_mutex);
> +
> +				/*
> +				 * Traverse locks, blocks and shares of this
> +				 * file and update file->f_locks count.
> +				 */
> +				if (nlm_inspect_file(data, file, match))
> +					ret = 1;
> +
> +				mutex_lock(&nlm_file_mutex);
> +			}
>  
> -			mutex_lock(&nlm_file_mutex);
>  			file->f_count--;
> -			/* No more references to this file. Let go of it. */
> -			if (list_empty(&file->f_blocks) && !file->f_locks
> -			 && !file->f_shares && !file->f_count) {
> -				hlist_del(&file->f_list);
> -				nlm_close_files(file);
> -				kfree(file);
> -			}
> +			nlm_file_release(file);
> +			file = next;
>  		}
>  	}
>  	mutex_unlock(&nlm_file_mutex);

Sashiko seems to think there is a regression here. See:

https://sashiko.dev/#/patchset/20260523014203.2462827-1-michael.bommarito@gmail.com?part=1
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH] lockd: pin next file across nlm_inspect_file lock-drop
Posted by Michael Bommarito 1 day, 7 hours ago
On Sat, May 23, 2026 at 7:05 AM Jeff Layton <jlayton@kernel.org> wrote:
> Sashiko seems to think there is a regression here. See:
> https://sashiko.dev/#/patchset/20260523014203.2462827-1-michael.bommarito@gmail.com?part=1

Yeah, the predicate check is a real regression.  I'll fix that and send a v2.

The other one  is a separate issue (nlmsvc_create_block).  I never saw
that path fire in my harness, but will send a separate patch if I can
get that to light up.

Thanks,
Mike