ceph/libceph: fix hung tasks and connection recovery during network disruptions

[PATCH v1 04/13] ceph: fix race condition in cleanup_session_requests()

Posted by Ionut Nechita (Wind River) 3 weeks, 5 days ago

From: Ionut Nechita <ionut.nechita@windriver.com>

When an MDS session is closed or reset, cleanup_session_requests()
only unregisters requests that are on the session's s_unsafe list.
However, requests are only added to s_unsafe after receiving an
"unsafe" reply from the MDS.
This creates a race condition: if a write request has been sent
but the MDS becomes unavailable before sending the unsafe reply,
the request will:
  - Have r_session set (points to the failed session)
  - Be in the request_tree
  - NOT be on s_unsafe list
  - Never have r_safe_completion signaled
Meanwhile, flush_mdlog_and_wait_mdsc_unsafe_requests() iterates
the request_tree looking for write requests with r_session set,
and waits on r_safe_completion for each one. Since the request
is not on s_unsafe, cleanup_session_requests() won't unregister
it, and the completion is never signaled - causing an indefinite
hang.
This was observed in production when running xfstests generic/013
in a loop, with stack traces showing:
  INFO: task fsstress:14466 blocked for more than 122 seconds.
  Call Trace:
    wait_for_completion+0x14a/0x340
    ceph_mdsc_sync+0x4b4/0xe80
    ceph_sync_fs+0xa0/0x4c0
    sync_filesystem+0x182/0x240
Fix this by extending cleanup_session_requests() to also unregister
requests that:
  - Belong to the closing session (r_session->s_mds matches)
  - Have NOT received an unsafe reply (CEPH_MDS_R_GOT_UNSAFE not set)
  - Have NOT received a safe reply (CEPH_MDS_R_GOT_SAFE not set)
These are requests that were in-flight when the session failed and
will never complete. Unregistering them signals r_safe_completion,
unblocking any waiters.
Requests that received an unsafe reply but not yet a safe reply
are already on s_unsafe and handled by the existing code. For
these, we preserve the original behavior of resetting r_attempts
to allow re-sending when the session reconnects.
Fixes: e3ec8d689cf4 ("ceph: clean up unsafe requests when reconnecting is denied")
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 fs/ceph/mds_client.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 37899464101f7..45abddd7f317e 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1792,6 +1792,8 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
 
 	doutc(cl, "mds%d\n", session->s_mds);
 	mutex_lock(&mdsc->mutex);
+
+	/* First, handle requests on the unsafe list */
 	while (!list_empty(&session->s_unsafe)) {
 		req = list_first_entry(&session->s_unsafe,
 				       struct ceph_mds_request, r_unsafe_item);
@@ -1803,14 +1805,30 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
 			mapping_set_error(req->r_unsafe_dir->i_mapping, -EIO);
 		__unregister_request(mdsc, req);
 	}
-	/* zero r_attempts, so kick_requests() will re-send requests */
+
+	/*
+	 * Iterate through all pending requests for this session.
+	 * Requests that haven't received an unsafe reply yet will never
+	 * complete on this session - unregister them to signal waiters.
+	 * Requests that got unsafe but not safe are handled above via
+	 * s_unsafe list; for any remaining, reset r_attempts to allow
+	 * re-sending when session reconnects.
+	 */
 	p = rb_first(&mdsc->request_tree);
 	while (p) {
 		req = rb_entry(p, struct ceph_mds_request, r_node);
 		p = rb_next(p);
 		if (req->r_session &&
-		    req->r_session->s_mds == session->s_mds)
-			req->r_attempts = 0;
+		    req->r_session->s_mds == session->s_mds) {
+			if (!test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags) &&
+			    !test_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags)) {
+				doutc(cl, " dropping pending request %llu\n",
+				      req->r_tid);
+				__unregister_request(mdsc, req);
+			} else {
+				req->r_attempts = 0;
+			}
+		}
 	}
 	mutex_unlock(&mdsc->mutex);
 }
-- 
2.53.0

Re: [PATCH v1 04/13] ceph: fix race condition in cleanup_session_requests()

Posted by Viacheslav Dubeyko 3 weeks, 4 days ago

On Thu, 2026-03-12 at 10:16 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> When an MDS session is closed or reset, cleanup_session_requests()
> only unregisters requests that are on the session's s_unsafe list.
> However, requests are only added to s_unsafe after receiving an
> "unsafe" reply from the MDS.
> This creates a race condition: if a write request has been sent
> but the MDS becomes unavailable before sending the unsafe reply,
> the request will:
>   - Have r_session set (points to the failed session)
>   - Be in the request_tree
>   - NOT be on s_unsafe list
>   - Never have r_safe_completion signaled
> Meanwhile, flush_mdlog_and_wait_mdsc_unsafe_requests() iterates
> the request_tree looking for write requests with r_session set,
> and waits on r_safe_completion for each one. Since the request
> is not on s_unsafe, cleanup_session_requests() won't unregister
> it, and the completion is never signaled - causing an indefinite
> hang.
> This was observed in production when running xfstests generic/013
> in a loop, with stack traces showing:
>   INFO: task fsstress:14466 blocked for more than 122 seconds.
>   Call Trace:
>     wait_for_completion+0x14a/0x340
>     ceph_mdsc_sync+0x4b4/0xe80
>     ceph_sync_fs+0xa0/0x4c0
>     sync_filesystem+0x182/0x240
> Fix this by extending cleanup_session_requests() to also unregister
> requests that:
>   - Belong to the closing session (r_session->s_mds matches)
>   - Have NOT received an unsafe reply (CEPH_MDS_R_GOT_UNSAFE not set)
>   - Have NOT received a safe reply (CEPH_MDS_R_GOT_SAFE not set)
> These are requests that were in-flight when the session failed and
> will never complete. Unregistering them signals r_safe_completion,
> unblocking any waiters.
> Requests that received an unsafe reply but not yet a safe reply
> are already on s_unsafe and handled by the existing code. For
> these, we preserve the original behavior of resetting r_attempts
> to allow re-sending when the session reconnects.
> Fixes: e3ec8d689cf4 ("ceph: clean up unsafe requests when reconnecting is denied")
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>  fs/ceph/mds_client.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 37899464101f7..45abddd7f317e 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1792,6 +1792,8 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
>  
>  	doutc(cl, "mds%d\n", session->s_mds);
>  	mutex_lock(&mdsc->mutex);
> +
> +	/* First, handle requests on the unsafe list */
>  	while (!list_empty(&session->s_unsafe)) {
>  		req = list_first_entry(&session->s_unsafe,
>  				       struct ceph_mds_request, r_unsafe_item);
> @@ -1803,14 +1805,30 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc,
>  			mapping_set_error(req->r_unsafe_dir->i_mapping, -EIO);
>  		__unregister_request(mdsc, req);
>  	}
> -	/* zero r_attempts, so kick_requests() will re-send requests */
> +
> +	/*
> +	 * Iterate through all pending requests for this session.
> +	 * Requests that haven't received an unsafe reply yet will never
> +	 * complete on this session - unregister them to signal waiters.
> +	 * Requests that got unsafe but not safe are handled above via
> +	 * s_unsafe list; for any remaining, reset r_attempts to allow
> +	 * re-sending when session reconnects.
> +	 */
>  	p = rb_first(&mdsc->request_tree);
>  	while (p) {
>  		req = rb_entry(p, struct ceph_mds_request, r_node);
>  		p = rb_next(p);
>  		if (req->r_session &&
> -		    req->r_session->s_mds == session->s_mds)
> -			req->r_attempts = 0;
> +		    req->r_session->s_mds == session->s_mds) {
> +			if (!test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags) &&
> +			    !test_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags)) {
> +				doutc(cl, " dropping pending request %llu\n",
> +				      req->r_tid);
> +				__unregister_request(mdsc, req);
> +			} else {
> +				req->r_attempts = 0;
> +			}
> +		}
>  	}
>  	mutex_unlock(&mdsc->mutex);
>  }

Nice fix.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

Thanks,
Slava.