nfs_writepages may loop forever with -EBADF after state recovery failure

[BUG]nfs_writepages may loop forever with -EBADF after state recovery failure
Posted by Li Lingfeng 2 weeks, 6 days ago
We have encountered an issue where the NFS client gets stuck in an
infinite loop in nfs_writepages after a server restart and state recovery
failure. This causes mount operations to hang because the superblock lock
is held by the looping writeback process.

Problem Description:
When the NFS server is restarted, the client's state manager attempts to
reclaim open files. If the server returns errors such as EROFS, EIO, or
ESTALE during reclamation, the affected file's state is marked as bad (via
nfs4_state_mark_open_context_bad). Subsequently, when the writeback work
(wb_workfn) tries to flush dirty pages for that inode, nfs_writepages
enters a loop because nfs_page_create returns -EBADF, and nfs_writepages
does not treat -EBADF as a fatal error, so it retries indefinitely.

The call chain is:
nfs4_do_reclaim
  nfs4_reclaim_open_state
   __nfs4_reclaim_open_state // get -ESTALE
    nfs4_open_reclaim // ops->recover_open
     nfs4_do_open_reclaim
      _nfs4_do_open_reclaim
       nfs4_open_recover
        nfs4_open_recover_helper // return -ESTALE
         nfs4_opendata_to_nfs4_state
          _nfs4_opendata_reclaim_to_nfs4_state
           nfs_refresh_inode
   nfs4_state_mark_recovery_failed
    nfs4_state_mark_open_context_bad
     set_bit // NFS_CONTEXT_BAD

wb_workfn
  wb_do_writeback
   wb_writeback
    writeback_sb_inodes
     __writeback_single_inode
      do_writepages
       nfs_writepages // loop here
        write_cache_pages
         nfs_writepages_callback
          nfs_do_writepage
           nfs_page_async_flush
            nfs_pageio_add_request
             nfs_pageio_add_request_mirror
              __nfs_pageio_add_request
               nfs_create_subreq
                nfs_page_create // return -EBADF

nfs_writepages retries the loop as long as the error is not fatal
according to nfs_error_is_fatal(). Since -EBADF is not considered fatal,
it keeps retrying forever. This prevents the superblock lock from being
released, causing any concurrent mount operation to hang.

Steps to Reproduce:
We have a reliable reproducer on a recent kernel (Linux 7.0-rc4, commit
2d1373e4246da3b58e1df058374ed6b101804e07).

1) Prepare a server with an export:
mkfs.ext4 -F /dev/sdb
mount /dev/sdb /mnt/sdb
echo "/mnt *(rw,no_root_squash,fsid=0)" > /etc/exports
echo "/mnt/sdb *(rw,no_root_squash,fsid=1)" >> /etc/exports
systemctl restart nfs-server
dd if=/dev/random of=/mnt/sdb/testfile bs=1k count=4 oflag=direct

2) On the client, mount the export and start a writer that holds a file
open and creates dirty pages:
mount -t nfs -o rw,vers=4.1,rsize=1024,wsize=1024 127.0.0.1:/sdb /mnt/sdbb

Run the following Python script in one terminal:
import os, time
fd = os.open("/mnt/sdbb/testfile", os.O_CREAT|os.O_WRONLY|os.O_TRUNC, 0o644)
buf = b'A' * 4096
for i in range(1024):  # ~1GB
     os.write(fd, buf)
print("dirty pages created, fd kept open, sleeping...")
time.sleep(10**9)

3) In another terminal, restart the server and wipe the underlying
filesystem to force ESTALE:
systemctl stop nfs-server
umount /dev/sdb
mkfs.ext4 -F /dev/sdb
mount /dev/sdb /mnt/sdb
echo "/mnt *(rw,no_root_squash,fsid=0)" > /etc/exports
echo "/mnt/sdb *(rw,no_root_squash,fsid=1)" >> /etc/exports
systemctl restart nfs-server

Temporary Workaround:
We have applied the following patch to break the loop by treating -EBADF
as fatal in nfs_writepages
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index dc57e67cefcd..0147f7a7a1a3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -781,7 +781,7 @@ int nfs_writepages(struct address_space *mapping, 
struct writeback_control *wbc)
                                         &pgio);
                 pgio.pg_error = 0;
                 nfs_pageio_complete(&pgio);
-       } while (err < 0 && !nfs_error_is_fatal(err));
+       } while (err < 0 && !nfs_error_is_fatal(err) && (err != -EBADF));
         nfs_io_completion_put(ioc);

         if (err < 0)

While the patch above avoids the hang, we wonder if a more comprehensive
fix is needed. For instance, perhaps nfs_error_is_fatal() should include
-EBADF in its fatal list, or the state manager should actively abort
pending I/O for contexts marked bad. We are not sure whether -EBADF should
always be considered fatal in writeback paths.

We would appreciate your insights and any suggestions for a proper fix.

Thanks,
Lingfeng