NFSD: Fix server hang when there are multiple layout conflicts

[Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Dai Ngo 3 months ago

When a layout conflict triggers a call to __break_lease, the function
nfsd4_layout_lm_break clears the fl_break_time timeout before sending
the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
its loop, waiting indefinitely for the conflicting file lease to be
released.

If the number of lease conflicts matches the number of NFSD threads (which
defaults to 8), all available NFSD threads become occupied. Consequently,
there are no threads left to handle incoming requests or callback replies,
leading to a total hang of the NFS server.

This issue is reliably reproducible by running the Git test suite on a
configuration using SCSI layout.

This patchset fixes this problem by introducing the new lm_breaker_timedout
operation to lease_manager_operations and using timeout for layout
lease break.

 Documentation/filesystems/locking.rst |  2 ++
 fs/locks.c                            | 14 +++++++++++---
 fs/nfsd/nfs4layouts.c                 | 25 +++++++++++++++++++++----
 include/linux/filelock.h              |  2 ++
 4 files changed, 36 insertions(+), 7 deletions(-)

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Benjamin Coddington 3 months ago

On 6 Nov 2025, at 12:05, Dai Ngo wrote:

> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
>
> If the number of lease conflicts matches the number of NFSD threads (which
> defaults to 8), all available NFSD threads become occupied. Consequently,
> there are no threads left to handle incoming requests or callback replies,
> leading to a total hang of the NFS server.
>
> This issue is reliably reproducible by running the Git test suite on a
> configuration using SCSI layout.
>
> This patchset fixes this problem by introducing the new lm_breaker_timedout
> operation to lease_manager_operations and using timeout for layout
> lease break.

Hey Dai,

I like your solution here, but I worry it can cause unexpected or
unnecessary client fencing when the problem is server-side (not enough
threads).  Clients might be dutifully sending LAYOUTRETURN, but the server
can't service them - and this change will cause some potentially unexpected
fencing in environments where things could be fixed (by adding more knfsd
threads).  Also, I think we significantly bumped default thread counts
recently in nfs-utils:
eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16

You probably have already seen previous discussions about this:
https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/

This also changes the behavior for all layouts, I haven't thought through
the implications of that - but I wish we could have knob for this behavior,
or perhaps a knfsd-specific fl_break_time tuneable.

Last thought (for now): I think Neil has some work for dynamic knfsd thread
count.. or Jeff?  (I am having trouble finding it) Would that work around
this problem?

Regards,
Ben

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Dai Ngo 2 months, 4 weeks ago

Hi Ben,

On 11/9/25 10:34 AM, Benjamin Coddington wrote:
> On 6 Nov 2025, at 12:05, Dai Ngo wrote:
>
>> When a layout conflict triggers a call to __break_lease, the function
>> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
>> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
>> its loop, waiting indefinitely for the conflicting file lease to be
>> released.
>>
>> If the number of lease conflicts matches the number of NFSD threads (which
>> defaults to 8), all available NFSD threads become occupied. Consequently,
>> there are no threads left to handle incoming requests or callback replies,
>> leading to a total hang of the NFS server.
>>
>> This issue is reliably reproducible by running the Git test suite on a
>> configuration using SCSI layout.
>>
>> This patchset fixes this problem by introducing the new lm_breaker_timedout
>> operation to lease_manager_operations and using timeout for layout
>> lease break.
> Hey Dai,
>
> I like your solution here, but I worry it can cause unexpected or
> unnecessary client fencing when the problem is server-side (not enough
> threads).  Clients might be dutifully sending LAYOUTRETURN, but the server
> can't service them

I agreed. This is a server problem and we penalize the client. We need
a long term solution for dealing resource shortage (server threads)
problem.

Fortunately, the client can detect reservation conflict errors and appears
to retry the I/O. Also, the client will ask for new layout and in the
process it re-registers its reservation key so I/O will continue.

>   - and this change will cause some potentially unexpected
> fencing in environments where things could be fixed (by adding more knfsd
> threads).
>    Also, I think we significantly bumped default thread counts
> recently in nfs-utils:
> eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16

This helps a bit but if there is always a chance that there is a load
that requires more than the number of server threads.

>
> You probably have already seen previous discussions about this:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/__;!!ACWV5N9M2RV99hQ!Pq4vHQs-qk71XjZ0vOkONTD7nxkuyUUEKTBsJJ0L_OrFWudokphCyc2V0q0_OrNoGD3KnsgoHKp7rb_lDcs$
>
> This also changes the behavior for all layouts, I haven't thought through
> the implications of that - but I wish we could have knob for this behavior,
> or perhaps a knfsd-specific fl_break_time tuneable.

There is already a knob to tune the fl_break_time:
# cat /proc/sys/fs/lease-break-time

but currently lease-break-time is in seconds so the minimum we can set
is 1 which I think is still too long to tight up a server thread.

>
> Last thought (for now): I think Neil has some work for dynamic knfsd thread
> count.. or Jeff?  (I am having trouble finding it) Would that work around
> this problem?

This would help, and I prefer this route rather than rework __break_lease
to return EAGAIN/jukebox while the server recalling the layout.

Thank you for your feedback,
-Dai

>
> Regards,
> Ben
>

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Jeff Layton 2 months, 4 weeks ago

On Tue, 2025-11-11 at 07:24 -0800, Dai Ngo wrote:
> Hi Ben,
> 
> On 11/9/25 10:34 AM, Benjamin Coddington wrote:
> > On 6 Nov 2025, at 12:05, Dai Ngo wrote:
> > 
> > > When a layout conflict triggers a call to __break_lease, the function
> > > nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> > > the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> > > its loop, waiting indefinitely for the conflicting file lease to be
> > > released.
> > > 
> > > If the number of lease conflicts matches the number of NFSD threads (which
> > > defaults to 8), all available NFSD threads become occupied. Consequently,
> > > there are no threads left to handle incoming requests or callback replies,
> > > leading to a total hang of the NFS server.
> > > 
> > > This issue is reliably reproducible by running the Git test suite on a
> > > configuration using SCSI layout.
> > > 
> > > This patchset fixes this problem by introducing the new lm_breaker_timedout
> > > operation to lease_manager_operations and using timeout for layout
> > > lease break.
> > Hey Dai,
> > 
> > I like your solution here, but I worry it can cause unexpected or
> > unnecessary client fencing when the problem is server-side (not enough
> > threads).  Clients might be dutifully sending LAYOUTRETURN, but the server
> > can't service them
> 
> I agreed. This is a server problem and we penalize the client. We need
> a long term solution for dealing resource shortage (server threads)
> problem.
> 
> Fortunately, the client can detect reservation conflict errors and appears
> to retry the I/O. Also, the client will ask for new layout and in the
> process it re-registers its reservation key so I/O will continue.
> 
> >   - and this change will cause some potentially unexpected
> > fencing in environments where things could be fixed (by adding more knfsd
> > threads).
> >    Also, I think we significantly bumped default thread counts
> > recently in nfs-utils:
> > eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16
> 
> This helps a bit but if there is always a chance that there is a load
> that requires more than the number of server threads.
> 
> > 
> > You probably have already seen previous discussions about this:
> > https://urldefense.com/v3/__https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/__;!!ACWV5N9M2RV99hQ!Pq4vHQs-qk71XjZ0vOkONTD7nxkuyUUEKTBsJJ0L_OrFWudokphCyc2V0q0_OrNoGD3KnsgoHKp7rb_lDcs$
> > 
> > This also changes the behavior for all layouts, I haven't thought through
> > the implications of that - but I wish we could have knob for this behavior,
> > or perhaps a knfsd-specific fl_break_time tuneable.
> 
> There is already a knob to tune the fl_break_time:
> # cat /proc/sys/fs/lease-break-time
> 
> but currently lease-break-time is in seconds so the minimum we can set
> is 1 which I think is still too long to tight up a server thread.
> 
> > 
> > Last thought (for now): I think Neil has some work for dynamic knfsd thread
> > count.. or Jeff?  (I am having trouble finding it) Would that work around
> > this problem?
> 


It would help, up to a point, but so does increasing the static thread
count. Even if we get dynamically sized threadpool, we'll likely still
have a hard cap on the number of threads. We'll always going to be
subject to this if VFS operations are going to be blocking on
delegation breaks.

> This would help, and I prefer this route rather than rework __break_lease
> to return EAGAIN/jukebox while the server recalling the layout.
> 


One way I can see to address this properly is to allow for non-blocking
lease breaks in some fashion. Basically, have the fs return
-EAGAIN (maybe after a short wait) at some point so that maybe
LAYOUTRETURN can get through once the client retries).

Plumbing that intent down to the actual break_layout() calls is a
problem though -- that's a lot of layers. I wonder if we need some per-
task flag that tells the layout engine "always do non-blocking lease
breaks"? That sounds pretty ugly too.

The only other way I could see to fix this is to move to an
asynchronous model of some sort. IOW, have at least some operations
(anything that could conceivably cause a layout break) done
asynchronously.

Then you could dispatch the operation and put the rqstp on a some sort
of waitqueue, and then let the thread do more work instead of blocking.
When the work is done, just requeue the rqstp to send the reply.

Just thinking out loud, but maybe we could use io_uring's underlying
infrastructure for this? Basically, set up an io_uring but do it all in
kernel space in nfsd thread context?
-- 
Jeff Layton <jlayton@kernel.org>

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Chuck Lever 2 months, 4 weeks ago

On 11/11/25 10:24 AM, Dai Ngo wrote:
>>
>> Last thought (for now): I think Neil has some work for dynamic knfsd
>> thread
>> count.. or Jeff?  (I am having trouble finding it) Would that work around
>> this problem?
> 
> This would help, and I prefer this route rather than rework __break_lease
> to return EAGAIN/jukebox while the server recalling the layout.

Jeff is looking at continuing Neil's work in this area.

Adding more threads, IMHO, is not a good long term solution for this
particular issue. There's no guarantee that the server won't get stuck
no matter how many threads are created, and practically speaking, there
are only so many threads that can be created before the server goes
belly up. Or put another way, there's no way to formally prove that the
server will always be able to make forward progress with this solution.

We want NFSD to have a generic mechanism for deferring work so that an
nfsd thread never waits more than a few dozen milliseconds for anything.
This is the tactic NFSD uses for delegation recalls, for example.

-- 
Chuck Lever

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Dai Ngo 2 months, 4 weeks ago

On 11/11/25 7:34 AM, Chuck Lever wrote:
> On 11/11/25 10:24 AM, Dai Ngo wrote:
>>> Last thought (for now): I think Neil has some work for dynamic knfsd
>>> thread
>>> count.. or Jeff?  (I am having trouble finding it) Would that work around
>>> this problem?
>> This would help, and I prefer this route rather than rework __break_lease
>> to return EAGAIN/jukebox while the server recalling the layout.
> Jeff is looking at continuing Neil's work in this area.
>
> Adding more threads, IMHO, is not a good long term solution for this
> particular issue. There's no guarantee that the server won't get stuck
> no matter how many threads are created, and practically speaking, there
> are only so many threads that can be created before the server goes
> belly up. Or put another way, there's no way to formally prove that the
> server will always be able to make forward progress with this solution.
>
> We want NFSD to have a generic mechanism for deferring work so that an
> nfsd thread never waits more than a few dozen milliseconds for anything.
> This is the tactic NFSD uses for delegation recalls, for example.

I think we need both: (1) dynamic number of server threads and (2) the
ability to defer work as we currently do for the delegation recall. I'd
think we need (1) first as it applies for general server operations and
not just layout recalls.

Even if we had both of these enhancements, we still need to enforce timeout
for __break_lease since we don't want to wait for the recall forever.

-Dai


>
>

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Chuck Lever 2 months, 4 weeks ago

On 11/11/25 10:43 AM, Dai Ngo wrote:
> I think we need both: (1) dynamic number of server threads and (2) the
> ability to defer work as we currently do for the delegation recall.

Agreed. I wasn't trying to imply that dynamic thread count shouldn't be
done at all.


-- 
Chuck Lever

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Christoph Hellwig 2 months, 4 weeks ago

On Tue, Nov 11, 2025 at 10:34:04AM -0500, Chuck Lever wrote:
> > This would help, and I prefer this route rather than rework __break_lease
> > to return EAGAIN/jukebox while the server recalling the layout.
> 
> Jeff is looking at continuing Neil's work in this area.
> 
> Adding more threads, IMHO, is not a good long term solution for this
> particular issue. There's no guarantee that the server won't get stuck
> no matter how many threads are created, and practically speaking, there
> are only so many threads that can be created before the server goes
> belly up. Or put another way, there's no way to formally prove that the
> server will always be able to make forward progress with this solution.

Agreed.

> We want NFSD to have a generic mechanism for deferring work so that an
> nfsd thread never waits more than a few dozen milliseconds for anything.
> This is the tactic NFSD uses for delegation recalls, for example.

Agreed.  This would also be for I/O itself, as with O_DIRECT we can
fully support direct I/O, and even with buffered I/O there is some
limited non-blocking read and write support.

Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout conflicts

Posted by Christoph Hellwig 3 months ago

On Thu, Nov 06, 2025 at 09:05:24AM -0800, Dai Ngo wrote:
> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
> 
> If the number of lease conflicts matches the number of NFSD threads (which
> defaults to 8), all available NFSD threads become occupied. Consequently,
> there are no threads left to handle incoming requests or callback replies,
> leading to a total hang of the NFS server.
> 
> This issue is reliably reproducible by running the Git test suite on a
> configuration using SCSI layout.

I guess we need to implement asynchronous breaking of leases.  Which
conceptually shouldn't be too hard.