Documentation/filesystems/locking.rst | 2 ++ fs/locks.c | 14 +++++++++++--- fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++---- include/linux/filelock.h | 2 ++ 4 files changed, 36 insertions(+), 7 deletions(-)
When a layout conflict triggers a call to __break_lease, the function nfsd4_layout_lm_break clears the fl_break_time timeout before sending the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts its loop, waiting indefinitely for the conflicting file lease to be released. If the number of lease conflicts matches the number of NFSD threads (which defaults to 8), all available NFSD threads become occupied. Consequently, there are no threads left to handle incoming requests or callback replies, leading to a total hang of the NFS server. This issue is reliably reproducible by running the Git test suite on a configuration using SCSI layout. This patchset fixes this problem by introducing the new lm_breaker_timedout operation to lease_manager_operations and using timeout for layout lease break. Documentation/filesystems/locking.rst | 2 ++ fs/locks.c | 14 +++++++++++--- fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++---- include/linux/filelock.h | 2 ++ 4 files changed, 36 insertions(+), 7 deletions(-)
On 6 Nov 2025, at 12:05, Dai Ngo wrote: > When a layout conflict triggers a call to __break_lease, the function > nfsd4_layout_lm_break clears the fl_break_time timeout before sending > the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts > its loop, waiting indefinitely for the conflicting file lease to be > released. > > If the number of lease conflicts matches the number of NFSD threads (which > defaults to 8), all available NFSD threads become occupied. Consequently, > there are no threads left to handle incoming requests or callback replies, > leading to a total hang of the NFS server. > > This issue is reliably reproducible by running the Git test suite on a > configuration using SCSI layout. > > This patchset fixes this problem by introducing the new lm_breaker_timedout > operation to lease_manager_operations and using timeout for layout > lease break. Hey Dai, I like your solution here, but I worry it can cause unexpected or unnecessary client fencing when the problem is server-side (not enough threads). Clients might be dutifully sending LAYOUTRETURN, but the server can't service them - and this change will cause some potentially unexpected fencing in environments where things could be fixed (by adding more knfsd threads). Also, I think we significantly bumped default thread counts recently in nfs-utils: eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16 You probably have already seen previous discussions about this: https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/ This also changes the behavior for all layouts, I haven't thought through the implications of that - but I wish we could have knob for this behavior, or perhaps a knfsd-specific fl_break_time tuneable. Last thought (for now): I think Neil has some work for dynamic knfsd thread count.. or Jeff? (I am having trouble finding it) Would that work around this problem? Regards, Ben
Hi Ben, On 11/9/25 10:34 AM, Benjamin Coddington wrote: > On 6 Nov 2025, at 12:05, Dai Ngo wrote: > >> When a layout conflict triggers a call to __break_lease, the function >> nfsd4_layout_lm_break clears the fl_break_time timeout before sending >> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts >> its loop, waiting indefinitely for the conflicting file lease to be >> released. >> >> If the number of lease conflicts matches the number of NFSD threads (which >> defaults to 8), all available NFSD threads become occupied. Consequently, >> there are no threads left to handle incoming requests or callback replies, >> leading to a total hang of the NFS server. >> >> This issue is reliably reproducible by running the Git test suite on a >> configuration using SCSI layout. >> >> This patchset fixes this problem by introducing the new lm_breaker_timedout >> operation to lease_manager_operations and using timeout for layout >> lease break. > Hey Dai, > > I like your solution here, but I worry it can cause unexpected or > unnecessary client fencing when the problem is server-side (not enough > threads). Clients might be dutifully sending LAYOUTRETURN, but the server > can't service them I agreed. This is a server problem and we penalize the client. We need a long term solution for dealing resource shortage (server threads) problem. Fortunately, the client can detect reservation conflict errors and appears to retry the I/O. Also, the client will ask for new layout and in the process it re-registers its reservation key so I/O will continue. > - and this change will cause some potentially unexpected > fencing in environments where things could be fixed (by adding more knfsd > threads). > Also, I think we significantly bumped default thread counts > recently in nfs-utils: > eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16 This helps a bit but if there is always a chance that there is a load that requires more than the number of server threads. > > You probably have already seen previous discussions about this: > https://urldefense.com/v3/__https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/__;!!ACWV5N9M2RV99hQ!Pq4vHQs-qk71XjZ0vOkONTD7nxkuyUUEKTBsJJ0L_OrFWudokphCyc2V0q0_OrNoGD3KnsgoHKp7rb_lDcs$ > > This also changes the behavior for all layouts, I haven't thought through > the implications of that - but I wish we could have knob for this behavior, > or perhaps a knfsd-specific fl_break_time tuneable. There is already a knob to tune the fl_break_time: # cat /proc/sys/fs/lease-break-time but currently lease-break-time is in seconds so the minimum we can set is 1 which I think is still too long to tight up a server thread. > > Last thought (for now): I think Neil has some work for dynamic knfsd thread > count.. or Jeff? (I am having trouble finding it) Would that work around > this problem? This would help, and I prefer this route rather than rework __break_lease to return EAGAIN/jukebox while the server recalling the layout. Thank you for your feedback, -Dai > > Regards, > Ben >
On Tue, 2025-11-11 at 07:24 -0800, Dai Ngo wrote: > Hi Ben, > > On 11/9/25 10:34 AM, Benjamin Coddington wrote: > > On 6 Nov 2025, at 12:05, Dai Ngo wrote: > > > > > When a layout conflict triggers a call to __break_lease, the function > > > nfsd4_layout_lm_break clears the fl_break_time timeout before sending > > > the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts > > > its loop, waiting indefinitely for the conflicting file lease to be > > > released. > > > > > > If the number of lease conflicts matches the number of NFSD threads (which > > > defaults to 8), all available NFSD threads become occupied. Consequently, > > > there are no threads left to handle incoming requests or callback replies, > > > leading to a total hang of the NFS server. > > > > > > This issue is reliably reproducible by running the Git test suite on a > > > configuration using SCSI layout. > > > > > > This patchset fixes this problem by introducing the new lm_breaker_timedout > > > operation to lease_manager_operations and using timeout for layout > > > lease break. > > Hey Dai, > > > > I like your solution here, but I worry it can cause unexpected or > > unnecessary client fencing when the problem is server-side (not enough > > threads). Clients might be dutifully sending LAYOUTRETURN, but the server > > can't service them > > I agreed. This is a server problem and we penalize the client. We need > a long term solution for dealing resource shortage (server threads) > problem. > > Fortunately, the client can detect reservation conflict errors and appears > to retry the I/O. Also, the client will ask for new layout and in the > process it re-registers its reservation key so I/O will continue. > > > - and this change will cause some potentially unexpected > > fencing in environments where things could be fixed (by adding more knfsd > > threads). > > Also, I think we significantly bumped default thread counts > > recently in nfs-utils: > > eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16 > > This helps a bit but if there is always a chance that there is a load > that requires more than the number of server threads. > > > > > You probably have already seen previous discussions about this: > > https://urldefense.com/v3/__https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/__;!!ACWV5N9M2RV99hQ!Pq4vHQs-qk71XjZ0vOkONTD7nxkuyUUEKTBsJJ0L_OrFWudokphCyc2V0q0_OrNoGD3KnsgoHKp7rb_lDcs$ > > > > This also changes the behavior for all layouts, I haven't thought through > > the implications of that - but I wish we could have knob for this behavior, > > or perhaps a knfsd-specific fl_break_time tuneable. > > There is already a knob to tune the fl_break_time: > # cat /proc/sys/fs/lease-break-time > > but currently lease-break-time is in seconds so the minimum we can set > is 1 which I think is still too long to tight up a server thread. > > > > > Last thought (for now): I think Neil has some work for dynamic knfsd thread > > count.. or Jeff? (I am having trouble finding it) Would that work around > > this problem? > It would help, up to a point, but so does increasing the static thread count. Even if we get dynamically sized threadpool, we'll likely still have a hard cap on the number of threads. We'll always going to be subject to this if VFS operations are going to be blocking on delegation breaks. > This would help, and I prefer this route rather than rework __break_lease > to return EAGAIN/jukebox while the server recalling the layout. > One way I can see to address this properly is to allow for non-blocking lease breaks in some fashion. Basically, have the fs return -EAGAIN (maybe after a short wait) at some point so that maybe LAYOUTRETURN can get through once the client retries). Plumbing that intent down to the actual break_layout() calls is a problem though -- that's a lot of layers. I wonder if we need some per- task flag that tells the layout engine "always do non-blocking lease breaks"? That sounds pretty ugly too. The only other way I could see to fix this is to move to an asynchronous model of some sort. IOW, have at least some operations (anything that could conceivably cause a layout break) done asynchronously. Then you could dispatch the operation and put the rqstp on a some sort of waitqueue, and then let the thread do more work instead of blocking. When the work is done, just requeue the rqstp to send the reply. Just thinking out loud, but maybe we could use io_uring's underlying infrastructure for this? Basically, set up an io_uring but do it all in kernel space in nfsd thread context? -- Jeff Layton <jlayton@kernel.org>
On 11/11/25 10:24 AM, Dai Ngo wrote: >> >> Last thought (for now): I think Neil has some work for dynamic knfsd >> thread >> count.. or Jeff? (I am having trouble finding it) Would that work around >> this problem? > > This would help, and I prefer this route rather than rework __break_lease > to return EAGAIN/jukebox while the server recalling the layout. Jeff is looking at continuing Neil's work in this area. Adding more threads, IMHO, is not a good long term solution for this particular issue. There's no guarantee that the server won't get stuck no matter how many threads are created, and practically speaking, there are only so many threads that can be created before the server goes belly up. Or put another way, there's no way to formally prove that the server will always be able to make forward progress with this solution. We want NFSD to have a generic mechanism for deferring work so that an nfsd thread never waits more than a few dozen milliseconds for anything. This is the tactic NFSD uses for delegation recalls, for example. -- Chuck Lever
On 11/11/25 7:34 AM, Chuck Lever wrote: > On 11/11/25 10:24 AM, Dai Ngo wrote: >>> Last thought (for now): I think Neil has some work for dynamic knfsd >>> thread >>> count.. or Jeff? (I am having trouble finding it) Would that work around >>> this problem? >> This would help, and I prefer this route rather than rework __break_lease >> to return EAGAIN/jukebox while the server recalling the layout. > Jeff is looking at continuing Neil's work in this area. > > Adding more threads, IMHO, is not a good long term solution for this > particular issue. There's no guarantee that the server won't get stuck > no matter how many threads are created, and practically speaking, there > are only so many threads that can be created before the server goes > belly up. Or put another way, there's no way to formally prove that the > server will always be able to make forward progress with this solution. > > We want NFSD to have a generic mechanism for deferring work so that an > nfsd thread never waits more than a few dozen milliseconds for anything. > This is the tactic NFSD uses for delegation recalls, for example. I think we need both: (1) dynamic number of server threads and (2) the ability to defer work as we currently do for the delegation recall. I'd think we need (1) first as it applies for general server operations and not just layout recalls. Even if we had both of these enhancements, we still need to enforce timeout for __break_lease since we don't want to wait for the recall forever. -Dai > >
On 11/11/25 10:43 AM, Dai Ngo wrote: > I think we need both: (1) dynamic number of server threads and (2) the > ability to defer work as we currently do for the delegation recall. Agreed. I wasn't trying to imply that dynamic thread count shouldn't be done at all. -- Chuck Lever
On Tue, Nov 11, 2025 at 10:34:04AM -0500, Chuck Lever wrote: > > This would help, and I prefer this route rather than rework __break_lease > > to return EAGAIN/jukebox while the server recalling the layout. > > Jeff is looking at continuing Neil's work in this area. > > Adding more threads, IMHO, is not a good long term solution for this > particular issue. There's no guarantee that the server won't get stuck > no matter how many threads are created, and practically speaking, there > are only so many threads that can be created before the server goes > belly up. Or put another way, there's no way to formally prove that the > server will always be able to make forward progress with this solution. Agreed. > We want NFSD to have a generic mechanism for deferring work so that an > nfsd thread never waits more than a few dozen milliseconds for anything. > This is the tactic NFSD uses for delegation recalls, for example. Agreed. This would also be for I/O itself, as with O_DIRECT we can fully support direct I/O, and even with buffered I/O there is some limited non-blocking read and write support.
On Thu, Nov 06, 2025 at 09:05:24AM -0800, Dai Ngo wrote: > When a layout conflict triggers a call to __break_lease, the function > nfsd4_layout_lm_break clears the fl_break_time timeout before sending > the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts > its loop, waiting indefinitely for the conflicting file lease to be > released. > > If the number of lease conflicts matches the number of NFSD threads (which > defaults to 8), all available NFSD threads become occupied. Consequently, > there are no threads left to handle incoming requests or callback replies, > leading to a total hang of the NFS server. > > This issue is reliably reproducible by running the Git test suite on a > configuration using SCSI layout. I guess we need to implement asynchronous breaking of leases. Which conceptually shouldn't be too hard.
© 2016 - 2025 Red Hat, Inc.