fs/netfs/buffered_write.c | 38 ++++++++++++++++----------- fs/netfs/direct_write.c | 16 ------------ fs/netfs/internal.h | 26 ++++++++++++++++++- fs/netfs/main.c | 6 ++--- fs/netfs/misc.c | 50 ++++++++++++++++++++++-------------- fs/netfs/read_collect.c | 16 ++++++++---- fs/netfs/write_collect.c | 14 ++++++---- fs/netfs/write_retry.c | 3 +-- fs/smb/client/cifsglob.h | 1 + fs/smb/client/cifssmb.c | 22 ++++++++++++++++ fs/smb/client/smb2pdu.c | 37 ++++++++++++++++++-------- include/linux/netfs.h | 21 ++++++++------- include/trace/events/netfs.h | 29 ++++++++++++++------- 13 files changed, 183 insertions(+), 96 deletions(-)
Hi Christian, Steve, Here are some miscellaneous fixes and changes for netfslib and cifs, if you could consider pulling them. All the bugs fixed were observed in cifs, so they should probably go through the cifs tree unless Christian would much prefer for them to go through the VFS tree. Many of these were found because a bug in Samba was causing smbd to crash and restart after about 1-2s and this was vigorously and abruptly exercising the netfslib retry paths. Subsequent testing of the cifs RDMA support showed up some more bugs, but the fixes for those went via the cifs tree and have been removed from this set as they're now upstream. First, there are some netfs fixes: (1) Fix a hang due to missing case in final DIO read result collection not breaking out of a loop if the request finished, but there were no subrequests being processed and NETFS_RREQ_ALL_QUEUED wasn't yet set. (2) Fix a double put of the netfs_io_request struct if completion happened in the pause loop. (3) Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag wrangling. (4) Fix infinite looping in netfs_wait_for_pause/request() which wa caused by a loop waiting for NETFS_RREQ_ALL_QUEUED to get set - but which wouldn't get set until the looping function returned. This uses patch (3) above. (5) Fix a ref leak on an extra subrequest inserted into a request's list of subreqs because more subreq records were needed for retrying than were needed for the original request (say, for instance, that the amount of cifs credit available was reduced and, subsequently, the ops had to be smaller). Then a bunch of cifs fixes, some of which are from other people: (6-8) cifs: Fix various RPC callbacks to set NETFS_SREQ_NEED_RETRY if a subrequest fails retriably. (10) Fix a warning in the workqueue code when reconnecting a channel. Followed by some patches to deal with i_size handling: (11) Fix the updating of i_size to use a lock to avoid a race between testing if we should have extended the file with a DIO write and changing i_size. (12) A follow-up patch to (11) to merge the places in netfslib that update i_size on write. And finally a couple of patches to improve tracing output, but that should otherwise not affect functionality: (13) Renumber the NETFS_RREQ_* flags to make the hex values easier to interpret by eye, including moving the main status flags down to the lowest bits, with IN_PROGRESS in bit 0. (14) Update the tracepoints in a number of ways, including adding more tracepoints into the cifs read/write RPC callback so that differend MID_RESPONSE_* values can be differentiated. The last three patches (12-14) could wait for the next merge window. The patches can also be found here: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes Thanks, David David Howells (9): netfs: Fix hang due to missing case in final DIO read result collection netfs: Fix double put of request netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling netfs: Fix looping in wait functions netfs: Fix ref leak on inserted extra subreq in write retry netfs: Fix i_size updating netfs: Merge i_size update functions netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read netfs: Update tracepoints in a number of ways Paulo Alcantara (4): smb: client: set missing retry flag in smb2_writev_callback() smb: client: set missing retry flag in cifs_readv_callback() smb: client: set missing retry flag in cifs_writev_callback() smb: client: fix warning when reconnecting channel fs/netfs/buffered_write.c | 38 ++++++++++++++++----------- fs/netfs/direct_write.c | 16 ------------ fs/netfs/internal.h | 26 ++++++++++++++++++- fs/netfs/main.c | 6 ++--- fs/netfs/misc.c | 50 ++++++++++++++++++++++-------------- fs/netfs/read_collect.c | 16 ++++++++---- fs/netfs/write_collect.c | 14 ++++++---- fs/netfs/write_retry.c | 3 +-- fs/smb/client/cifsglob.h | 1 + fs/smb/client/cifssmb.c | 22 ++++++++++++++++ fs/smb/client/smb2pdu.c | 37 ++++++++++++++++++-------- include/linux/netfs.h | 21 ++++++++------- include/trace/events/netfs.h | 29 ++++++++++++++------- 13 files changed, 183 insertions(+), 96 deletions(-)
David Howells <dhowells@redhat.com> wrote: > Here are some miscellaneous fixes and changes for netfslib and cifs, if you > could consider pulling them. All the bugs fixed were observed in cifs, so > they should probably go through the cifs tree unless Christian would much > prefer for them to go through the VFS tree. Hi David, your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") has given me serious headaches; it has caused outages in our web hosting clusters (yet again - all Linux versions since 6.9 had serious netfs regressions). Your patch was backported to 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore the bugs it has caused will be "available" to all Linux stable users. The problem we had is that writing to certain files never finishes. It looks like it has to do with the cachefiles subrequest never reporting completion. (We use Ceph with cachefiles) I have tried applying the fixes in this pull request, which sounded promising, but the problem is still there. The only thing that helps is reverting 2b1424cd131c completely - everything is fine with 6.15.5 plus the revert. What do you need from me in order to analyze the bug? Max
Max Kellermann <max.kellermann@ionos.com> wrote: > your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about > the waitqueue used") has given me serious headaches; it has caused > outages in our web hosting clusters (yet again - all Linux versions > since 6.9 had serious netfs regressions). Your patch was backported to > 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore > the bugs it has caused will be "available" to all Linux stable users. > > The problem we had is that writing to certain files never finishes. It > looks like it has to do with the cachefiles subrequest never reporting > completion. (We use Ceph with cachefiles) > > I have tried applying the fixes in this pull request, which sounded > promising, but the problem is still there. The only thing that helps > is reverting 2b1424cd131c completely - everything is fine with 6.15.5 > plus the revert. > > What do you need from me in order to analyze the bug? As a start, can you turn on: echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable If you keep an eye on /proc/fs/netfs/requests you should be able to see any tasks in there that get stuck. If one gets stuck, then: echo 0 > /sys/kernel/debug/tracing/events/enable to stop further tracing. Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of the stuck request. If you can try grepping the trace log for that: grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace that should hopefully let me see how things progressed on that call. David
On Wed, Jul 9, 2025 at 3:01 PM David Howells <dhowells@redhat.com> wrote: > If you keep an eye on /proc/fs/netfs/requests you should be able to see any > tasks in there that get stuck. If one gets stuck, then: After one got stuck, this is what I see in /proc/fs/netfs/requests: REQUEST OR REF FL ERR OPS COVERAGE ======== == === == ==== === ========= 00000065 2C 2 80002020 0 0 @0000 0/0 > Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of > the stuck request. If you can try grepping the trace log for that: > > grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace kworker/u96:4-455 [008] ...1. 107.145222: netfs_sreq: R=00000065[1] WRIT PREP f=00 s=0 0/0 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145292: netfs_sreq: R=00000065[1] WRIT SUBMT f=100 s=0 0/29e1 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145311: netfs_sreq: R=00000065[1] WRIT CA-PR f=100 s=0 0/3000 s=0 e=0 kworker/u96:4-455 [008] ...1. 107.145457: netfs_sreq: R=00000065[1] WRIT CA-WR f=100 s=0 0/3000 s=0 e=0 kworker/8:1-437 [008] ...1. 107.149530: netfs_sreq: R=00000065[1] WRIT TERM f=100 s=0 3000/3000 s=2 e=0 kworker/8:1-437 [008] ...1. 107.149531: netfs_rreq: R=00000065 2C WAKE-Q f=80002020 I can reproduce this easily - it happens every time I log out of that machine when bash tries to write the bash_history file - the write() always gets stuck. (The above was 6.15.5 plus all patches in this PR.)
Max Kellermann <max.kellermann@ionos.com> wrote: > kworker/8:1-437 [008] ...1. 107.149531: netfs_rreq: > R=00000065 2C WAKE-Q f=80002020 > > ... > (The above was 6.15.5 plus all patches in this PR.) Can you check that, please? If you have patch 12 applied, then the flags should be renumbered and there shouldn't be a NETFS_RREQ_ flag with 13, but f=80002020 would seem to have 0x2000 (ie. bit 13) set in it. If you do have it applied, then this might be an indicator of the issue. David
On Wed, Jul 9, 2025 at 10:22 PM David Howells <dhowells@redhat.com> wrote: > > (The above was 6.15.5 plus all patches in this PR.) > > Can you check that, please? If you have patch 12 applied, then the flags > should be renumbered and there shouldn't be a NETFS_RREQ_ flag with 13, but > f=80002020 would seem to have 0x2000 (ie. bit 13) set in it. Oh, I was slightly wrong, I merged only 12 patches, omitting the renumbering patch because it had conflicts with my branch, and it was only a cosmetic change, not relevant for the bug. Sorry for the confusion!
Hi Max, I managed to reproduce it on my test machine with ceph + fscache. Does this fix the problem for you? David --- netfs: Fix copy-to-cache so that it performs collection with ceph+fscache The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs. Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org --- fs/netfs/read_pgpriv2.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 5bbe906a551d..080d2a6a51d9 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -110,6 +110,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( if (!creq->io_streams[1].avail) goto cancel_put; + __set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq;
On Thu, Jul 10, 2025 at 12:47 PM David Howells <dhowells@redhat.com> wrote: > I managed to reproduce it on my test machine with ceph + fscache. > > Does this fix the problem for you? Yes! I can no longer reproduce my problem. Thanks, David.
Hi Max, Depending on what you're doing on ceph, you might need the attached patch as well. I managed to reproduce it by doing a git clone and kernel build on a ceph mount with cachefiles active. David ---- commit 6c24e7124642846b628274c845cd6de252f1dfbf Author: David Howells <dhowells@redhat.com> Date: Thu Jul 10 15:02:57 2025 +0100 netfs: Fix race between cache write completion and ALL_QUEUED being set When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 080d2a6a51d9..889ff7954f8c 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -111,6 +111,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( goto cancel_put; __set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); + trace_netfs_copy2cache(rreq, creq); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq; @@ -155,6 +156,8 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq) netfs_issue_write(creq, &creq->io_streams[1]); smp_wmb(); /* Write lists before ALL_QUEUED. */ set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags); + trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache); + netfs_wake_collector(creq); netfs_put_request(creq, netfs_rreq_trace_put_return); creq->copy_to_cache = NULL; diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 73e96ccbe830..64a382fbc31a 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -55,6 +55,7 @@ EM(netfs_rreq_trace_copy, "COPY ") \ EM(netfs_rreq_trace_dirty, "DIRTY ") \ EM(netfs_rreq_trace_done, "DONE ") \ + EM(netfs_rreq_trace_end_copy_to_cache, "END-C2C") \ EM(netfs_rreq_trace_free, "FREE ") \ EM(netfs_rreq_trace_ki_complete, "KI-CMPL") \ EM(netfs_rreq_trace_recollect, "RECLLCT") \ @@ -559,6 +560,35 @@ TRACE_EVENT(netfs_write, __entry->start, __entry->start + __entry->len - 1) ); +TRACE_EVENT(netfs_copy2cache, + TP_PROTO(const struct netfs_io_request *rreq, + const struct netfs_io_request *creq), + + TP_ARGS(rreq, creq), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, creq) + __field(unsigned int, cookie) + __field(unsigned int, ino) + ), + + TP_fast_assign( + struct netfs_inode *__ctx = netfs_inode(rreq->inode); + struct fscache_cookie *__cookie = netfs_i_cookie(__ctx); + __entry->rreq = rreq->debug_id; + __entry->creq = creq->debug_id; + __entry->cookie = __cookie ? __cookie->debug_id : 0; + __entry->ino = rreq->inode->i_ino; + ), + + TP_printk("R=%08x CR=%08x c=%08x i=%x ", + __entry->rreq, + __entry->creq, + __entry->cookie, + __entry->ino) + ); + TRACE_EVENT(netfs_collect, TP_PROTO(const struct netfs_io_request *wreq),
David Howells <dhowells@redhat.com> wrote: > Depending on what you're doing on ceph, you might need the attached patch as > well. I managed to reproduce it by doing a git clone and kernel build on a > ceph mount with cachefiles active. Here's a version of the patch that conditionally does the needed wakeup. I don't want to force processing if there's no need. David --- commit 1fe42a9a7f0b2f51107574f0b8e151d13dc766cc Author: David Howells <dhowells@redhat.com> Date: Thu Jul 10 15:02:57 2025 +0100 netfs: Fix race between cache write completion and ALL_QUEUED being set When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index 080d2a6a51d9..8097bc069c1d 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -111,6 +111,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( goto cancel_put; __set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags); + trace_netfs_copy2cache(rreq, creq); trace_netfs_write(creq, netfs_write_trace_copy_to_cache); netfs_stat(&netfs_n_wh_copy_to_cache); rreq->copy_to_cache = creq; @@ -155,6 +156,9 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq) netfs_issue_write(creq, &creq->io_streams[1]); smp_wmb(); /* Write lists before ALL_QUEUED. */ set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags); + trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache); + if (list_empty_careful(&creq->io_streams[1].subrequests)) + netfs_wake_collector(creq); netfs_put_request(creq, netfs_rreq_trace_put_return); creq->copy_to_cache = NULL; diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 73e96ccbe830..64a382fbc31a 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -55,6 +55,7 @@ EM(netfs_rreq_trace_copy, "COPY ") \ EM(netfs_rreq_trace_dirty, "DIRTY ") \ EM(netfs_rreq_trace_done, "DONE ") \ + EM(netfs_rreq_trace_end_copy_to_cache, "END-C2C") \ EM(netfs_rreq_trace_free, "FREE ") \ EM(netfs_rreq_trace_ki_complete, "KI-CMPL") \ EM(netfs_rreq_trace_recollect, "RECLLCT") \ @@ -559,6 +560,35 @@ TRACE_EVENT(netfs_write, __entry->start, __entry->start + __entry->len - 1) ); +TRACE_EVENT(netfs_copy2cache, + TP_PROTO(const struct netfs_io_request *rreq, + const struct netfs_io_request *creq), + + TP_ARGS(rreq, creq), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, creq) + __field(unsigned int, cookie) + __field(unsigned int, ino) + ), + + TP_fast_assign( + struct netfs_inode *__ctx = netfs_inode(rreq->inode); + struct fscache_cookie *__cookie = netfs_i_cookie(__ctx); + __entry->rreq = rreq->debug_id; + __entry->creq = creq->debug_id; + __entry->cookie = __cookie ? __cookie->debug_id : 0; + __entry->ino = rreq->inode->i_ino; + ), + + TP_printk("R=%08x CR=%08x c=%08x i=%x ", + __entry->rreq, + __entry->creq, + __entry->cookie, + __entry->ino) + ); + TRACE_EVENT(netfs_collect, TP_PROTO(const struct netfs_io_request *wreq),
David Howells <dhowells@redhat.com> wrote: > I managed to reproduce it on my test machine with ceph + fscache. > > Does this fix the problem for you? There's at least one more bug in there, so this won't fix all of the cases. David
© 2016 - 2025 Red Hat, Inc.