From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6A983BA253 for ; Thu, 7 May 2026 12:27:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156871; cv=none; b=Mfww9wM6362mOTmzIg4nYvDo+Z8xy5lLapie2GV7aAPWuorzRQX3nkBpKMvQGroDgs52atmAq1aSe4TD2pP0Ofek3XzCyRcJdDYVV19GNn6TLMfGEiiIdnUDkKeQ9mKB9eSN9fwQHNlE3RBJfO2UxOGYOuGgf/Qq1Ip6SLUTJf0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156871; c=relaxed/simple; bh=BkwvWfSeKDQBzYewgCSDOX7+DXJwYrncWZ0nVxvGRoc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JbdbHRw2TYeJ437Pq/6QiXVYsmW1XqMcPptYVG8MZ4bjBZqRDZqsm5iJgbH1liwAQ9tCneXPNkh0ymoSQgMgDH8+nkgnnrUSMaXPu1yjYbLqMexlqkGM3tqqbYWL+xEj9taTDnijiSzOZaYsZWEbAfZMgOGDfb6uD7uHfSJK4RQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Q2sCl4Dg; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=NgDHkcSx; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Q2sCl4Dg"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="NgDHkcSx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156868; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mzE1ZZ9iLT8Ua7YvTuaaF3p2dABvmzNW3xB0CEXdZLE=; b=Q2sCl4Dg7H3fSFK17tKRGrqjisx22BJVj7wJ2ZZ4sJLtjSzJ3ztma1LdO1dS3ShQtPzUNi f3GMtRmd1cAdvMBiDMjEV0EcR9nipBnwZE9VwJn16Yr8IzTb1TW3XKD8XTTTml3dGyXW5J A573k8mXl0mY8VZneUtgcrwaVrleUos= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-187-Lzfy1aVTO2qIxQ_KkaF9cQ-1; Thu, 07 May 2026 08:27:47 -0400 X-MC-Unique: Lzfy1aVTO2qIxQ_KkaF9cQ-1 X-Mimecast-MFC-AGG-ID: Lzfy1aVTO2qIxQ_KkaF9cQ_1778156866 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-b94062e85f9so94220066b.2 for ; Thu, 07 May 2026 05:27:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156866; x=1778761666; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mzE1ZZ9iLT8Ua7YvTuaaF3p2dABvmzNW3xB0CEXdZLE=; b=NgDHkcSxUCnOBOLkfYLXfQit1A/eb0L2fXMEn5K2g30hcUSmURyzuK+HduAlS0/qQh xyfzb6zsCB7j05+poxVAOLir3XNFFDX+5/SKZf7ZTeRj3bYtw9ZLG4NOI8plOaFjoDW/ G7WBEQrrhLk8WUNh/5h9SK3sA1GGS5Gp9yO9tHckjESZ2KqXp9FOUTT008G0fsnRng1E ZR14LvvABPqaFHGdqiMO3+uchoh7WWYl4Hkwrk5B8P+sVZKVAM3iFY7UlATZNF1HRWsP C7G8zOZcLl4wLSmK+kuBHLfG/oDMmQRmpoW+PFTejHq0S7LjGB3BAInoJAd47wqmykPN z3sQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156866; x=1778761666; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=mzE1ZZ9iLT8Ua7YvTuaaF3p2dABvmzNW3xB0CEXdZLE=; b=RqLua54pUgiHA3VPO9nOGtGcRLmX0Wfoojrh2Mjw+qO68mNDoGNM2NADS/u6BO03t9 3HFZkHFai5XYxpvW+rbT15D72lZV8fgWy3C16rXhtYagl1BcqGkFW2e5oMG9seBT5sq1 4Jgjd9PLxkNoTnEE9GrfQZ9DrEJ7pJNno+msI4RqghBkWRe0NQdhwNIhULWgusXzUbOr ClIhLwh7KOr5Y+62zXO24RsO7qKzgnT2b0BajknH574K6bd7cXdRu7yQGp9mUHZXIkuj kttV2/sJ95HIzz+5nJUkPyw3CYvMBhjZkmZ9nk/0yZvowsf7PKLVINC+qNcCslFym4Qf RB4w== X-Gm-Message-State: AOJu0YzKT09Y+Lpyt2CauyV92WR/vfOveoi9I6xcMcjM+pJi4wC96Yhf 04VHdNIQu9mkBaMWVjT2y9jL9VpI+sdDXc9G+daXQhn020cWN21Mj+RyZ04CSG7VW/gmwB42DpI k/tUbT5ubPZZH2oQkPNhlAXW8LreMZirUKy5C8Royk40n999G44wnDdDECC9h3hhbg8SJXqy0uE Xw X-Gm-Gg: AeBDietXNMHP48rCoYMjcUF30uirhhbKwcSjZU/c0D9nP7Y0YHwcSX8HOdLI92yqNZ3 H416y9//gafkQReGvqZSba2Zhl6QakKDLbu1zc7EMfFM8iB5U7rLXrRdULke+HTmGfW8LuTuZw9 2rvuDKzv3zCQBjEw8YRf75cuHA7PNLqjoljf4IWa7wSU/sJkwIqkBkjmzU77s95GJnnQrUIXUUc GplqTiJjxdeuJ2y+hYtDTMgxl7LSIDnJVSXMtUcBudssFCbhH4mSl0/yOP7rFc1ssMrvd12P4Wk NMEITU59QEdcZAQBqKL6NRdEmTdRT/cIL7g2wwOAAnr6IPSCEfrtB6kp7OmQwARZwemguoroDGo G6OywQMqpWlTC+Lfe4eBic+HVErRrSKEaHwGGPo+SQd1GfHf2Of+57cJmDHp8feePlA== X-Received: by 2002:a17:907:3f24:b0:bc6:2bd3:8176 with SMTP id a640c23a62f3a-bc62bd388e2mr341827966b.35.1778156865610; Thu, 07 May 2026 05:27:45 -0700 (PDT) X-Received: by 2002:a17:907:3f24:b0:bc6:2bd3:8176 with SMTP id a640c23a62f3a-bc62bd388e2mr341823066b.35.1778156864692; Thu, 07 May 2026 05:27:44 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:44 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Date: Thu, 7 May 2026 12:27:27 +0000 Message-Id: <20260507122737.2804094-2-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Define named bit-position constants for all CEPH_I_* inode flags and derive the bitmask values from them. This gives every flag a named _BIT constant usable with the test_bit/set_bit/clear_bit family. The intentionally unused bit position 1 is documented inline. Convert all flag modifications to use atomic bitops (set_bit, clear_bit, test_and_clear_bit). The previous code mixed lockless atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic read-modify-write (|=3D / &=3D ~) on other flags sharing the same unsigned long. A concurrent non-atomic RMW can clobber an adjacent lockless atomic update -- for example, a lockless clear_bit(ERROR_WRITE) could be silently resurrected by a concurrent ci->i_ceph_flags |=3D CEPH_I_FLUSH under the spinlock. Using atomic bitops for all modifications eliminates this class of race entirely. Flags whose only users are now the _BIT form (ERROR_WRITE, ASYNC_CHECK_CAPS) have their old mask defines removed to document that callers must use the _BIT constant with the set_bit/test_bit family. ERROR_FILELOCK and SHUTDOWN retain their mask defines because they are still used via bitmask tests in lockless readers (ceph_inode_is_shutdown, reconnect_caps_cb). The direct assignment in ceph_finish_async_create() is converted from i_ceph_flags =3D CEPH_I_ASYNC_CREATE to set_bit(). This inode is I_NEW at this point -- still invisible to other threads and guaranteed to have zero flags from alloc_inode -- so either form is safe, but set_bit() keeps the conversion uniform. Co-developed-by: Viacheslav Dubeyko Signed-off-by: Viacheslav Dubeyko Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- fs/ceph/addr.c | 20 +++++++------- fs/ceph/caps.c | 24 ++++++++--------- fs/ceph/file.c | 13 ++++----- fs/ceph/inode.c | 4 +-- fs/ceph/locks.c | 22 ++++----------- fs/ceph/mds_client.c | 3 ++- fs/ceph/mds_client.h | 2 +- fs/ceph/snap.c | 2 +- fs/ceph/super.h | 64 +++++++++++++++++++++++--------------------- fs/ceph/xattr.c | 2 +- 10 files changed, 74 insertions(+), 82 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 94ffa127b1d3..1859a0c92d66 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -2563,7 +2563,8 @@ int ceph_pool_perm_check(struct inode *inode, int nee= d) struct ceph_inode_info *ci =3D ceph_inode(inode); struct ceph_string *pool_ns; s64 pool; - int ret, flags; + int ret; + unsigned long flags; =20 /* Only need to do this for regular files */ if (!S_ISREG(inode->i_mode)) @@ -2605,20 +2606,19 @@ int ceph_pool_perm_check(struct inode *inode, int n= eed) if (ret < 0) return ret; =20 - flags =3D CEPH_I_POOL_PERM; - if (ret & POOL_READ) - flags |=3D CEPH_I_POOL_RD; - if (ret & POOL_WRITE) - flags |=3D CEPH_I_POOL_WR; - spin_lock(&ci->i_ceph_lock); if (pool =3D=3D ci->i_layout.pool_id && pool_ns =3D=3D rcu_dereference_raw(ci->i_layout.pool_ns)) { - ci->i_ceph_flags |=3D flags; - } else { + set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); + if (ret & POOL_READ) + set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags); + if (ret & POOL_WRITE) + set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags); + } else { pool =3D ci->i_layout.pool_id; - flags =3D ci->i_ceph_flags; } + /* Re-read flags under the lock so check: sees the updated bits. */ + flags =3D ci->i_ceph_flags; spin_unlock(&ci->i_ceph_lock); goto check; } diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index d51454e995a8..cb9e78b713d9 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_c= lient *mdsc, =20 doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list); @@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, stru= ct ceph_cap *cap, ceph_cap_string(revoking)); BUG_ON((retain & CEPH_CAP_PIN) =3D=3D 0); =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH; + clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); =20 cap->issued &=3D retain; /* drop bits we don't want */ /* @@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, last_tid =3D capsnap->cap_flush.tid; } =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH_SNAPS; + clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 while (first_tid <=3D last_tid) { struct ceph_cap *cap =3D ci->i_auth_cap; @@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int = flags) =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - ci->i_ceph_flags |=3D CEPH_I_ASYNC_CHECK_CAPS; + set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags); =20 /* Don't send messages until we get async create reply */ spin_unlock(&ci->i_ceph_lock); @@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_clie= nt *mdsc, if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) return; =20 - ci->i_ceph_flags &=3D ~CEPH_I_KICK_FLUSH; + clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); =20 list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) { if (cf->is_capsnap) { @@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_cl= ient *mdsc, __kick_flushing_caps(mdsc, session, ci, oldest_flush_tid); } else { - ci->i_ceph_flags |=3D CEPH_I_KICK_FLUSH; + set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); } =20 spin_unlock(&ci->i_ceph_lock); @@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int = need, int want, spin_lock(&ci->i_ceph_lock); =20 if ((flags & CHECK_FILELOCK) && - (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) { + test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { doutc(cl, "%p %llx.%llx error filelock\n", inode, ceph_vinop(inode)); ret =3D -EIO; @@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_i= nfo *ci, BUG_ON(capsnap->cap_flush.tid > 0); ceph_put_snap_context(capsnap->context); if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps)) - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 list_del(&capsnap->ci_item); ceph_put_cap_snap(capsnap); @@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_inf= o *ci, int nr, if (ceph_try_drop_cap_snap(ci, capsnap)) { put++; } else { - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); flush_snaps =3D true; } } @@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode, =20 if (ci->i_layout.pool_id !=3D old_pool || extra_info->pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 extra_info->pool_ns =3D old_ns; =20 @@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode) doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add_tail(&ci->i_cap_delay_list, @@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct = ceph_cap *cap, bool *invali =20 if (atomic_read(&ci->i_filelock_ref) > 0) { /* make further file lock syscall return -EIO */ - ci->i_ceph_flags |=3D CEPH_I_ERROR_FILELOCK; + set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); pr_warn_ratelimited_client(cl, " dropping file locks for %p %llx.%llx\n", inode, ceph_vinop(inode)); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index d54d71669176..7ca9f60fb0e5 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -598,12 +598,12 @@ static void wake_async_create_waiters(struct inode *i= node, =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags); + /* Serialized by i_ceph_lock; the two ops touch different bits. */ + clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags); =20 - if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) { - ci->i_ceph_flags &=3D ~CEPH_I_ASYNC_CHECK_CAPS; + if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, + &ci->i_ceph_flags)) check_cap =3D true; - } } ceph_kick_flushing_inode_caps(session, ci); spin_unlock(&ci->i_ceph_lock); @@ -766,7 +766,8 @@ static int ceph_finish_async_create(struct inode *dir, = struct inode *inode, * that point and don't worry about setting * CEPH_I_ASYNC_CREATE. */ - ceph_inode(inode)->i_ceph_flags =3D CEPH_I_ASYNC_CREATE; + set_bit(CEPH_I_ASYNC_CREATE_BIT, + &ceph_inode(inode)->i_ceph_flags); unlock_new_inode(inode); } if (d_in_lookup(dentry) || d_really_is_negative(dentry)) { @@ -2482,7 +2483,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, st= ruct iov_iter *from) =20 if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) =3D=3D 0 || (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) || - (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { + test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) { struct ceph_snap_context *snapc; struct iov_iter data; =20 diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 22c7da1ea61c..4871d7ab2730 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1180,7 +1180,7 @@ int ceph_fill_inode(struct inode *inode, struct page = *locked_page, rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns); =20 if (ci->i_layout.pool_id !=3D old_pool || pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 pool_ns =3D old_ns; =20 @@ -3240,7 +3240,7 @@ void ceph_inode_shutdown(struct inode *inode) bool invalidate =3D false; =20 spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_SHUTDOWN; + set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags); p =3D rb_first(&ci->i_caps); while (p) { struct ceph_cap *cap =3D rb_entry(p, struct ceph_cap, ci_node); diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index dd764f9c64b9..c4ff2266bb94 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl) ci =3D ceph_inode(inode); if (atomic_dec_and_test(&ci->i_filelock_ref)) { /* clear error when all locks are released */ - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_FILELOCK; - spin_unlock(&ci->i_ceph_lock); + clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); } fl->fl_u.ceph.inode =3D NULL; iput(inode); @@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file= _lock *fl) else if (IS_SETLKW(cmd)) wait =3D 1; =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl)) posix_lock_file(file, fl, NULL); - return err; + return -EIO; } =20 if (lock_is_read(fl)) @@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct fil= e_lock *fl) =20 doutc(cl, "fl_file: %p\n", fl->c.flc_file); =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (lock_is_unlock(fl)) locks_lock_file_wait(file, fl); - return err; + return -EIO; } =20 if (IS_SETLKW(cmd)) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index ed17e0023705..53f1012a9e7d 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3657,7 +3657,8 @@ static void __do_request(struct ceph_mds_client *mdsc, =20 spin_lock(&ci->i_ceph_lock); cap =3D ci->i_auth_cap; - if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds !=3D cap->mds) { + if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) && + mds !=3D cap->mds) { doutc(cl, "session changed for auth cap %d -> %d\n", cap->session->s_mds, session->s_mds); =20 diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 4e6c87f8414c..d873e784b025 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -670,7 +670,7 @@ static inline int ceph_wait_on_async_create(struct inod= e *inode) { struct ceph_inode_info *ci =3D ceph_inode(inode); =20 - return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT, + return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT, TASK_KILLABLE); } =20 diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 52b4c2684f92..9b79a5eaca93 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci, return 0; } =20 - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=3D%llu\n", inode, ceph_vinop(inode), capsnap, capsnap->context, capsnap->context->seq, ceph_cap_string(capsnap->dirty), diff --git a/fs/ceph/super.h b/fs/ceph/super.h index afc89ce91804..cb45a59dbb19 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -665,23 +665,34 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, /* * Ceph inode. */ -#define CEPH_I_DIR_ORDERED (1 << 0) /* dentries in dir are ordered */ -#define CEPH_I_FLUSH (1 << 2) /* do not delay flush of dirty metadata */ -#define CEPH_I_POOL_PERM (1 << 3) /* pool rd/wr bits are valid */ -#define CEPH_I_POOL_RD (1 << 4) /* can read from pool */ -#define CEPH_I_POOL_WR (1 << 5) /* can write to pool */ -#define CEPH_I_SEC_INITED (1 << 6) /* security initialized */ -#define CEPH_I_KICK_FLUSH (1 << 7) /* kick flushing caps */ -#define CEPH_I_FLUSH_SNAPS (1 << 8) /* need flush snapss */ -#define CEPH_I_ERROR_WRITE (1 << 9) /* have seen write errors */ -#define CEPH_I_ERROR_FILELOCK (1 << 10) /* have seen file lock errors */ -#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ -#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) -#define CEPH_ASYNC_CREATE_BIT (12) /* async create in flight for this */ -#define CEPH_I_ASYNC_CREATE (1 << CEPH_ASYNC_CREATE_BIT) -#define CEPH_I_SHUTDOWN (1 << 13) /* inode is no longer usable */ -#define CEPH_I_ASYNC_CHECK_CAPS (1 << 14) /* check caps immediately after = async - creating finishes */ +#define CEPH_I_DIR_ORDERED_BIT (0) /* dentries in dir are ordered */ + /* bit 1 historically unused */ +#define CEPH_I_FLUSH_BIT (2) /* do not delay flush of dirty metadata */ +#define CEPH_I_POOL_PERM_BIT (3) /* pool rd/wr bits are valid */ +#define CEPH_I_POOL_RD_BIT (4) /* can read from pool */ +#define CEPH_I_POOL_WR_BIT (5) /* can write to pool */ +#define CEPH_I_SEC_INITED_BIT (6) /* security initialized */ +#define CEPH_I_KICK_FLUSH_BIT (7) /* kick flushing caps */ +#define CEPH_I_FLUSH_SNAPS_BIT (8) /* need flush snaps */ +#define CEPH_I_ERROR_WRITE_BIT (9) /* have seen write errors */ +#define CEPH_I_ERROR_FILELOCK_BIT (10) /* have seen file lock errors */ +#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ +#define CEPH_I_ASYNC_CREATE_BIT (12) /* async create in flight for this */ +#define CEPH_I_SHUTDOWN_BIT (13) /* inode is no longer usable */ +#define CEPH_I_ASYNC_CHECK_CAPS_BIT (14) /* check caps after async creatin= g finishes */ + +#define CEPH_I_DIR_ORDERED (1 << CEPH_I_DIR_ORDERED_BIT) +#define CEPH_I_FLUSH (1 << CEPH_I_FLUSH_BIT) +#define CEPH_I_POOL_PERM (1 << CEPH_I_POOL_PERM_BIT) +#define CEPH_I_POOL_RD (1 << CEPH_I_POOL_RD_BIT) +#define CEPH_I_POOL_WR (1 << CEPH_I_POOL_WR_BIT) +#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT) +#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT) +#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT) +#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT) +#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) +#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT) +#define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT) =20 /* * Masks of ceph inode work. @@ -694,27 +705,18 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, =20 /* * We set the ERROR_WRITE bit when we start seeing write errors on an inode - * and then clear it when they start succeeding. Note that we do a lockless - * check first, and only take the lock if it looks like it needs to be cha= nged. - * The write submission code just takes this as a hint, so we're not too - * worried if a few slip through in either direction. + * and then clear it when they start succeeding. The write submission code + * just takes this as a hint, so we're not too worried if a few slip throu= gh + * in either direction. */ static inline void ceph_set_error_write(struct ceph_inode_info *ci) { - if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void ceph_clear_error_write(struct ceph_inode_info *ci) { - if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci, diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index e773be07f767..860fc8e1867d 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const ch= ar *name, void *value, if (current->journal_info && !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) && security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN)) - ci->i_ceph_flags |=3D CEPH_I_SEC_INITED; + set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags); out: spin_unlock(&ci->i_ceph_lock); return err; --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FDE73CAE9B for ; Thu, 7 May 2026 12:27:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156872; cv=none; b=jGjkY+norNYpkDUHaSgPXjA9oEGpqDcvxWR901v4R3tf2/ZHkASM/DIRxvYRCkvhJiRKaEGXIhOk9RjqTXjOujdlB/FN9YYYsx+0b0C6ImRAWDiJF+uKL+17cO/7CBpe65D/DfPE+8+5FaX92GjSERGHiPzgTd17f25r0Lft95o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156872; c=relaxed/simple; bh=dFfXxXUgWjZXnJaD7FdFUoB2bFjm330eUC68lYkgAbg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AkJ/rF5xDej3dAZBFSAS4HNeba/0gRXkzOUOp0s2MNLgQVSHT96yDQ7PXuWgTVfdEOkjMFC71EBeNVGzyaKnxOXBzHoUpERrmQF9tWG3x3HMd+RqZjtxgrUeRzsLPbbLBsQjAq3/JfUhws+Eu8Vn0LNOTz1BebLG3osftT+ZMv4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=E1FIdEka; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=IS2k4HPa; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="E1FIdEka"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="IS2k4HPa" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FShB2QBNaYE6cemrLjaYKdu3+KNOFPBeUgQ/un1s7QA=; b=E1FIdEkaweLl4d+iJkRt6IB/jCmvOqac0DVn2AmTCB25ko3M68CzCxSduq6+HPkkiV8Zrc 45p22iAv21lkxRjX5r0xB7K1QkEmFsAb5m/s1weiU450iDjZGEW2MkUVQZ0wTYnncPofGk LLdLXrEE+JGfZpASUPNIdcQBbk9iOx8= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-280-fFNdCOsUOWmGGFU3Z3gPOQ-1; Thu, 07 May 2026 08:27:48 -0400 X-MC-Unique: fFNdCOsUOWmGGFU3Z3gPOQ-1 X-Mimecast-MFC-AGG-ID: fFNdCOsUOWmGGFU3Z3gPOQ_1778156868 Received: by mail-ej1-f70.google.com with SMTP id a640c23a62f3a-bc10780e9acso87939266b.2 for ; Thu, 07 May 2026 05:27:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156868; x=1778761668; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FShB2QBNaYE6cemrLjaYKdu3+KNOFPBeUgQ/un1s7QA=; b=IS2k4HPao6dGO/lNNtvjcShjhSsKWUl5bbcxsjYDghh2t+/4ZriUEbwuRq0dKTCAtO IoWYvR+VnOvwI0qj06/s8wiHxss/TaE177cOVaexvaKaMzOUO7w0us26ogpuke0hhzqe X1Sx6MiHa52jtqHUR0VN7jhDWWVhOUTRespLPXNexr7xQ2vJbxokmhGdSLbqDX7Z52za e23MWP7aP3Gz8tkaCF7ulUPZnzoM1o76JqG0xDUZXSX0AhCKI6PspPoaqpeGlWTYbRYt YZTrDqGhZdxFXLs0xRVbr2gX7zXOw/+eCsKB4Ol9RJUEgXzHhveQSsQCj0btMJ5EsLu9 VycQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156868; x=1778761668; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FShB2QBNaYE6cemrLjaYKdu3+KNOFPBeUgQ/un1s7QA=; b=tFi105/ImUYeFqjCcE9jb5G11gFcqpNU4NfxtHasiF2Wqkx9LfozbxDq8EczaqePbf Dh/WUDAMEMEiAV5Tem7wUClShxS1CRMFa1xDL6w54pNjbv8b5HCMQFUIRnuXhmq8CFc8 5veuz3m1JiCWUT1Q2BP8IlPjG7RRq9wD87o2zKKOcf1ZDWhsAWADT/GlVG0Fn7MlG3w/ hiPSnrzlJSWfn5Pn9BZl8a33SbEEkIc7uh6bGQiz6BcTXP9Fi5TTA7rj5pUzd08+Yzv7 CNF9zERFxC1nNQoNnJJGHBknCIwLZK2i41cimQARXMvTv/1WU/mULcjrYapygjdAO08Z fSVQ== X-Gm-Message-State: AOJu0YxiaSMNC0lp/8Xd+X/W6QLWG02vNJf9YzH2QRvRclGBqjq01Wq/ VQYnoc9v+EPTjGQYlvgsq+1JaAnKKNEogknMi0ZBgo7O1e0vQbXwDAnrtqVeTn9gDUcHaXAupvW GONQheYddiqm2iStPQpfoCyoAp6/namx+RsCyaZfKplvmsHsUEdXJNIMCxeAQfwzYag== X-Gm-Gg: AeBDievZOIKOrfBesn4cR32wLNA4/W/PbFZBmDLkQe+c3wcMJDCDgtvVH2DeropUfbh 7MzAds4VTXh6Z+sRadA2v//LvsuexVdcFa5p21lpFwaPhj+JqhLk2/8BSh0tD7ulcJh54uRnWeJ HmJS2K76NO+guFEIyPUffJUlLEjKPdrbzn7QDXJkQTPYGBntIhlFN9WuqTb/Xuz9W3vzMFWdfFx NG7TUTj3hOikFeHrnXRds7fJzKSlcAoN53UQmBOS4PaqmJBKIDs7RGR0bKOhK1W/DyiJpZxD1jx KbajNMyJZ5G8Xe1HMisBglPJU9ucKQfUa1WzW2XOKN72C4fk179w3MVxG92jqTwBzRRy5mO+vp6 zuqhaQTHMWOyU2d+eeY/X8pLNQX0vBlVJfvNLr+m7inzcRpv+lM8NdvjSiHlnm8a6ug== X-Received: by 2002:a17:907:1685:b0:bc6:7238:bd4e with SMTP id a640c23a62f3a-bc67239005cmr242650066b.5.1778156867560; Thu, 07 May 2026 05:27:47 -0700 (PDT) X-Received: by 2002:a17:907:1685:b0:bc6:7238:bd4e with SMTP id a640c23a62f3a-bc67239005cmr242647966b.5.1778156866908; Thu, 07 May 2026 05:27:46 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:46 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze , Viacheslav Dubeyko Subject: [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect Date: Thu, 7 May 2026 12:27:28 +0000 Message-Id: <20260507122737.2804094-3-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace the __force __le32 cast with cpu_to_le32() for the flock_len field in reconnect_caps_cb(). The old code used a type-system bypass to silence sparse; the new form uses the proper endian conversion macro. Also switch from a raw bitmask test against i_ceph_flags to test_bit() on the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the unsigned long flags field after the bit-position conversion. Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers use the _BIT form with test_bit/set_bit/clear_bit. Reviewed-by: Viacheslav Dubeyko Signed-off-by: Alex Markuze Tested-by: Viacheslav Dubeyko --- fs/ceph/mds_client.c | 5 +++-- fs/ceph/super.h | 1 - 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 53f1012a9e7d..d9543399b129 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4747,8 +4747,9 @@ static int reconnect_caps_cb(struct inode *inode, int= mds, void *arg) rec.v2.issued =3D cpu_to_le32(cap->issued); rec.v2.snaprealm =3D cpu_to_le64(ci->i_snap_realm->ino); rec.v2.pathbase =3D cpu_to_le64(path_info.vino.ino); - rec.v2.flock_len =3D (__force __le32) - ((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1); + rec.v2.flock_len =3D cpu_to_le32( + test_bit(CEPH_I_ERROR_FILELOCK_BIT, + &ci->i_ceph_flags) ? 0 : 1); } else { struct timespec64 ts; =20 diff --git a/fs/ceph/super.h b/fs/ceph/super.h index cb45a59dbb19..8afc6f3a10da 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -689,7 +689,6 @@ static inline struct inode *ceph_find_inode(struct supe= r_block *sb, #define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT) #define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT) #define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT) -#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT) #define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) #define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT) #define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT) --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 76F41391E78 for ; Thu, 7 May 2026 12:27:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156879; cv=none; b=AOzIU8L9e1/PN1TM35ZCawszDTB3IGKWxbpBErEZcXPEVMpvBgUI++QPnY0HhEMSLbgVAJNEjKzEKaq5G7FphvAsP462MaT2V728xGwyNF/dLFDl9WXRW/rysupBW+uqh2ClzcpQoNr+vFHqcOV8z4heYJ1+mCdIcqUqHyEbhfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156879; c=relaxed/simple; bh=IYn9i/5O3beFSM+OMajTTszwcrOZ3SjymuLCkQLppFI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=R+X7vjCRQGhqmUI13NOCl8tixLop0tUMQWdNVA5akLwTj2M9mKzonvjqQ3ZsaI7O1QVhN5vCAr1rFf6XGUsBSks05x4KA17r35ML7jzZJxxlKXZ/GVpeo2IPUlTHFDmLP8DIiJBaN2bp7R25lk4246pmg7W9bkZEC8dONTEPlc8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=PpPXZFtq; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=djCRnSuJ; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PpPXZFtq"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="djCRnSuJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8czzATqxP/no5wQIIvba6jo3pp6Yda95SZ70pZTYiSw=; b=PpPXZFtqBeQ5M59M72N2nua9JhzmxhMdVYAKAh1ofQX93QgvBMv+BfCxD7l0/20eLKe+Em Lp9nRSNhXZBYq+BYviN7xoY28TdmB+V+QxydT4draNVxQCEvJtYtLg5uQl3GbVw6/XXNlS kkUpesqr7bnZGQxIGAluAM+Bq0iN0FQ= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-191-VjOzO8GKMPSjQgEPHGe6TQ-1; Thu, 07 May 2026 08:27:51 -0400 X-MC-Unique: VjOzO8GKMPSjQgEPHGe6TQ-1 X-Mimecast-MFC-AGG-ID: VjOzO8GKMPSjQgEPHGe6TQ_1778156870 Received: by mail-ej1-f70.google.com with SMTP id a640c23a62f3a-b9399d68111so98454666b.0 for ; Thu, 07 May 2026 05:27:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156870; x=1778761670; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=8czzATqxP/no5wQIIvba6jo3pp6Yda95SZ70pZTYiSw=; b=djCRnSuJ+JKJkcUTdhAV6n59xJqKSRYYh1SOAubltgUwdT0TmcUEGtNj2cVmoFfaf+ iqd1A69yaTWzQ9Pk0Czi8tuetfasqfLK3MwpGGgth59TxqxPpt74gT4kElXMqbV0ZMJ2 hbZPNsZHxc5rWSB+1ChRN4nSECc4WtnwI874oO5Amr9VOxevUBbpY3lZIKW0XpGAYmFp 4uT2va5i9W6PM+C8PGcNsNyr7nkPvqqZ28tna8Mps7NK0X06pZqe5j7oRBWPgKOsQkDI H5iCom/nGfrciIkvBEDwq6Y/M7gmEKPS/y2waK7CFlzCr/FQshS+Y8EVm0/2vzp1yRo1 bJ9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156870; x=1778761670; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8czzATqxP/no5wQIIvba6jo3pp6Yda95SZ70pZTYiSw=; b=qERzuuIaH33YvoWlFT0Mkk9AsePaSO5AvGqgW5nHdWNoPD6W/RibGZ1faxppPj820R hjR+tpZAm9W3WMiCqmwjwKwt8GnfwAxtE4yCmiGbJrxsxE6SiXiTrUwvWtPz6LB13dXv aKfe1UOvFdbG1/0I73OXNxxZmAHmRDgiCYscuRCbUN+1ukHqkzbdcy8643Airh2PaTK1 A5R09fnh9RWDAxh+FeafppGomT43oQqAbUd4oRFzfmGT6OowSx7VMn0u3/GDHp0VnbMt uggijuqNoe/DFVvKSwBCQ8WexhWj+X+2OQ7ISZvqE8mZnskzhQM1CfcN+FTsa7pJEAaf 8riA== X-Gm-Message-State: AOJu0YzcWVwzpiQvZ39sSne86G3dSu3rORRWvUe/+ldfX8DBgI9+mOL1 ntXWxzqM/mIt50SwvydVont6gWS+D2v3j3kE3BVuDfDeYMNttp/a6pQmJ35ioSITyJu5/7bUGJs 0o5r/CqCtf+tg91x+NlS7CFuJWLVBRZlAD7ENkxInN5QCFg58EkwRC1ZiNkHz7a7y2A== X-Gm-Gg: AeBDievQnASZF1NdudfPju0OQ0UQyql3al+wBR8s4USj74Bl+sySSarPEgrO8DUHtew ZffC9hV0znWxuxGfUblIjAEv5tjIoahxHK3IS+BzQDm61nBre+gTw1hf4fuc7tJ3RucCbsN2lXh Y1WmufBvauyeMD48OZi/A0mP4GlBt42TJfkTdQMyE0VlodMz90remJKphs7A8XPgH71s6Rs+zUX hm3ILhN9cAggy4Fd2c1WZEtF0XGv4NdRFt6YI2clkVyBe49sSv0E3MOpgKdatbiANejzUs1g/y8 L6CDzsEGotFX0eXQEJlCgxjb4iZBgzK2lv2dJcCF6fEbfX1RMw25D7yNGib8teX9JYsNp47W9F1 oTcPS1fBFXO4nxBuJIE38sQVgYCiaa5GWxb+CL7iNiEpgkYJ3LR3NZh5SY3OMhYTLaA== X-Received: by 2002:a17:907:3e9a:b0:ba8:e7b5:39ad with SMTP id a640c23a62f3a-bc567967b28mr446108766b.0.1778156869758; Thu, 07 May 2026 05:27:49 -0700 (PDT) X-Received: by 2002:a17:907:3e9a:b0:ba8:e7b5:39ad with SMTP id a640c23a62f3a-bc567967b28mr446107166b.0.1778156869047; Thu, 07 May 2026 05:27:49 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:48 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Date: Thu, 7 May 2026 12:27:29 +0000 Message-Id: <20260507122737.2804094-4-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Change send_mds_reconnect() to return an error code so callers can detect and report reconnect failures instead of silently ignoring them. Add early bailout checks for sessions that are already closed, rejected, or unregistered, which avoids sending reconnect messages for sessions that can no longer be recovered. The early -ESTALE and -ENOENT bailouts use a separate fail_return label that skips the pr_err_client diagnostic, since these codes indicate expected concurrent-teardown races rather than genuine reconnect build failures. Move the "reconnect start" log after the early-bailout checks so it only appears for sessions that actually proceed with reconnect. Save the prior session state before transitioning to RECONNECTING, and restore it in the failure path. Without this, a transient build or encoding failure (-ENOMEM, -ENOSPC) strands the session in RECONNECTING indefinitely because check_new_map() only retries sessions in RESTARTING state. Rewrite mds_peer_reset() to handle the case where the MDS is past its RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT messages because it only accepts them during its own RECONNECT window after restart. Previously, the client would send a doomed reconnect that the MDS would reject or ignore. Now, the client tears the session down locally and lets new requests re-open a fresh session, which is the correct recovery for this scenario. The RECONNECTING state is handled on the same teardown path, since the MDS will reject reconnect attempts from an active client regardless of the session's local state. Add explicit cases for CLOSED and REJECTED session states in mds_peer_reset() since these are terminal states where a connection drop is expected behavior. The session teardown path in mds_peer_reset() follows the established drop-and-reacquire locking pattern from check_new_map(): take mdsc->mutex for session unregistration, release it, then take s->s_mutex separately for cleanup. This avoids introducing a new simultaneous lock nesting pattern. Log reconnect failures from check_new_map() and mds_peer_reset() at pr_warn level rather than pr_err, since return codes like -ESTALE (closed/rejected session) and -ENOENT (unregistered session) are expected during concurrent teardown. Log dropped messages for unregistered sessions via doutc() (dynamic debug) rather than pr_info, as post-reset message arrival is routine and does not warrant unconditional logging. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- fs/ceph/mds_client.c | 178 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 163 insertions(+), 15 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index d9543399b129..249419c17d3c 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4470,9 +4470,14 @@ static void handle_session(struct ceph_mds_session *= session, break; =20 case CEPH_SESSION_REJECT: - WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING); - pr_info_client(cl, "mds%d rejected session\n", - session->s_mds); + WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING && + session->s_state !=3D CEPH_MDS_SESSION_RECONNECTING); + if (session->s_state =3D=3D CEPH_MDS_SESSION_RECONNECTING) + pr_info_client(cl, "mds%d reconnect rejected\n", + session->s_mds); + else + pr_info_client(cl, "mds%d rejected session\n", + session->s_mds); session->s_state =3D CEPH_MDS_SESSION_REJECTED; cleanup_session_requests(mdsc, session); remove_session_caps(session); @@ -4732,6 +4737,14 @@ static int reconnect_caps_cb(struct inode *inode, in= t mds, void *arg) cap->mseq =3D 0; /* and migrate_seq */ cap->cap_gen =3D atomic_read(&cap->session->s_cap_gen); =20 + /* + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect. + * Instead, locks are submitted for best-effort MDS reclaim + * via the flock_len field below. If reclaim fails (e.g., + * another client grabbed a conflicting lock), future lock + * operations will fail and set the error flag at that point. + */ + /* These are lost when the session goes away */ if (S_ISDIR(inode->i_mode)) { if (cap->issued & CEPH_CAP_DIR_CREATE) { @@ -4946,20 +4959,19 @@ static int encode_snap_realms(struct ceph_mds_clien= t *mdsc, * * This is a relatively heavyweight operation, but it's rare. */ -static void send_mds_reconnect(struct ceph_mds_client *mdsc, - struct ceph_mds_session *session) +static int send_mds_reconnect(struct ceph_mds_client *mdsc, + struct ceph_mds_session *session) { struct ceph_client *cl =3D mdsc->fsc->client; struct ceph_msg *reply; int mds =3D session->s_mds; int err =3D -ENOMEM; + int old_state; struct ceph_reconnect_state recon_state =3D { .session =3D session, }; LIST_HEAD(dispose); =20 - pr_info_client(cl, "mds%d reconnect start\n", mds); - recon_state.pagelist =3D ceph_pagelist_alloc(GFP_NOFS); if (!recon_state.pagelist) goto fail_nopagelist; @@ -4968,9 +4980,37 @@ static void send_mds_reconnect(struct ceph_mds_clien= t *mdsc, if (!reply) goto fail_nomsg; =20 + mutex_lock(&session->s_mutex); + + /* Serialized by s_mutex against concurrent ceph_get_deleg_ino(). */ xa_destroy(&session->s_delegated_inos); + if (session->s_state =3D=3D CEPH_MDS_SESSION_CLOSED || + session->s_state =3D=3D CEPH_MDS_SESSION_REJECTED) { + pr_info_client(cl, "mds%d skipping reconnect, session %s\n", + mds, + ceph_session_state_name(session->s_state)); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ESTALE; + goto fail_return; + } =20 - mutex_lock(&session->s_mutex); + /* s_mutex -> mdsc->mutex matches cleanup_session_requests() order. */ + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || mdsc->sessions[mds] !=3D session) { + mutex_unlock(&mdsc->mutex); + pr_info_client(cl, + "mds%d skipping reconnect, session unregistered\n", + mds); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ENOENT; + goto fail_return; + } + mutex_unlock(&mdsc->mutex); + + pr_info_client(cl, "mds%d reconnect start\n", mds); + old_state =3D session->s_state; session->s_state =3D CEPH_MDS_SESSION_RECONNECTING; session->s_seq =3D 0; =20 @@ -5100,7 +5140,7 @@ static void send_mds_reconnect(struct ceph_mds_client= *mdsc, =20 up_read(&mdsc->snap_rwsem); ceph_pagelist_release(recon_state.pagelist); - return; + return 0; =20 fail_clear_cap_reconnect: spin_lock(&session->s_cap_lock); @@ -5109,13 +5149,29 @@ static void send_mds_reconnect(struct ceph_mds_clie= nt *mdsc, fail: ceph_msg_put(reply); up_read(&mdsc->snap_rwsem); + /* + * Restore prior session state so map-driven reconnect logic + * (check_new_map) can retry. Without this, a transient build + * failure strands the session in RECONNECTING indefinitely. + */ + session->s_state =3D old_state; mutex_unlock(&session->s_mutex); fail_nomsg: ceph_pagelist_release(recon_state.pagelist); fail_nopagelist: pr_err_client(cl, "error %d preparing reconnect for mds%d\n", err, mds); - return; + return err; + +fail_return: + /* + * Early-exit path for expected concurrent-teardown races + * (-ESTALE for closed/rejected sessions, -ENOENT for + * unregistered sessions). Skip the pr_err_client diagnostic + * since these are not genuine reconnect build failures. + */ + ceph_pagelist_release(recon_state.pagelist); + return err; } =20 =20 @@ -5196,9 +5252,15 @@ static void check_new_map(struct ceph_mds_client *md= sc, */ if (s->s_state =3D=3D CEPH_MDS_SESSION_RESTARTING && newstate >=3D CEPH_MDS_STATE_RECONNECT) { + int rc; + mutex_unlock(&mdsc->mutex); clear_bit(i, targets); - send_mds_reconnect(mdsc, s); + rc =3D send_mds_reconnect(mdsc, s); + if (rc) + pr_warn_client(cl, + "mds%d reconnect failed: %d\n", + i, rc); mutex_lock(&mdsc->mutex); } =20 @@ -5262,7 +5324,11 @@ static void check_new_map(struct ceph_mds_client *md= sc, } doutc(cl, "send reconnect to export target mds.%d\n", i); mutex_unlock(&mdsc->mutex); - send_mds_reconnect(mdsc, s); + err =3D send_mds_reconnect(mdsc, s); + if (err) + pr_warn_client(cl, + "mds%d export target reconnect failed: %d\n", + i, err); ceph_put_mds_session(s); mutex_lock(&mdsc->mutex); } @@ -6350,12 +6416,92 @@ static void mds_peer_reset(struct ceph_connection *= con) { struct ceph_mds_session *s =3D con->private; struct ceph_mds_client *mdsc =3D s->s_mdsc; + int session_state; =20 pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n", s->s_mds); - if (READ_ONCE(mdsc->fsc->mount_state) !=3D CEPH_MOUNT_FENCE_IO && - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >=3D CEPH_MDS_STATE_REC= ONNECT) - send_mds_reconnect(mdsc, s); + + if (READ_ONCE(mdsc->fsc->mount_state) =3D=3D CEPH_MOUNT_FENCE_IO || + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONN= ECT) + return; + + /* + * Only reconnect if MDS is in its RECONNECT phase. An MDS past + * RECONNECT (REJOIN, CLIENTREPLAY, ACTIVE) will reject reconnect + * attempts, so those states fall through to session teardown below. + */ + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) =3D=3D CEPH_MDS_STATE_R= ECONNECT) { + int rc =3D send_mds_reconnect(mdsc, s); + + if (rc) + pr_warn_client(mdsc->fsc->client, + "mds%d reconnect failed: %d\n", + s->s_mds, rc); + return; + } + + /* + * MDS is active (past RECONNECT). It will not accept a + * CLIENT_RECONNECT from us, so tear the session down locally + * and let new requests re-open a fresh session. + * + * Snapshot session state with READ_ONCE, then revalidate under + * mdsc->mutex before acting. The subsequent mdsc->mutex + * section rechecks s_state to catch concurrent transitions, so + * the lockless snapshot here is safe. s->s_mutex is taken + * separately for cleanup after unregistration, which avoids + * introducing a new s->s_mutex + mdsc->mutex nesting. + */ + session_state =3D READ_ONCE(s->s_state); + + switch (session_state) { + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + mutex_lock(&mdsc->mutex); + if (s->s_mds >=3D mdsc->max_sessions || + mdsc->sessions[s->s_mds] !=3D s || + s->s_state !=3D session_state) { + pr_info_client(mdsc->fsc->client, + "mds%d state changed to %s during peer reset\n", + s->s_mds, + ceph_session_state_name(s->s_state)); + mutex_unlock(&mdsc->mutex); + return; + } + + ceph_get_mds_session(s); + s->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, s); + __wake_requests(mdsc, &s->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&s->s_mutex); + cleanup_session_requests(mdsc, s); + remove_session_caps(s); + mutex_unlock(&s->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, s->s_mds); + mutex_unlock(&mdsc->mutex); + + ceph_put_mds_session(s); + break; + case CEPH_MDS_SESSION_CLOSED: + case CEPH_MDS_SESSION_REJECTED: + break; + default: + pr_warn_client(mdsc->fsc->client, + "mds%d peer reset in unexpected state %s\n", + s->s_mds, + ceph_session_state_name(session_state)); + break; + } } =20 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg) @@ -6367,6 +6513,8 @@ static void mds_dispatch(struct ceph_connection *con,= struct ceph_msg *msg) =20 mutex_lock(&mdsc->mutex); if (__verify_registered_session(mdsc, s) < 0) { + doutc(cl, "dropping tid %llu from unregistered session %d\n", + le64_to_cpu(msg->hdr.tid), s->s_mds); mutex_unlock(&mdsc->mutex); goto out; } --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D120E3BE16E for ; Thu, 7 May 2026 12:27:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156878; cv=none; b=L3MLROxhXpHkMfeXGkbrDNjzTIsNRZD4dE7rYvUdRX04Eh9X1JdBLRLXpbtJo4tWE6adTFjke0UIvArb302pbGAsRmztf4OH1U0X3WFgASJJyWXLguhTpvmLS7TmUQBm9EI2xMyuPGN4uQdlI8tFti3lwt8YL+sK7q4ClQpbg5E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156878; c=relaxed/simple; bh=834wA956110D4wqm48qs+M4yAUcS2IG/A1O0L4YOJDE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=gprkhbqn3v2hWTGcBtRQfFqmU7/5EBDM8v0o8p6BPPpCQacCXxG2yazB31QcM/jDsIa2QkQURo+LvSdVFkvIyAcRQ/GdsiAumWwPPDvPgbdNjU8JAyX3yYIi0CbfFpums8+Q1xMeMqjXtvolsX2iIfoiZRr3cYbcrJNnSA1aGto= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=LZCD6w0P; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=eONPNPrL; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="LZCD6w0P"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="eONPNPrL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156874; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zX3vYHYUteuuNITwnN2MGryb5ln7tP2ioM7ViT/pmII=; b=LZCD6w0Ps+xGExzClno/bXJY3715Z7YM/O4hj4NNQ1KWzqZ0JyYQpdwRDsgswDYfUnFyl+ YMJ01hMeZPD/t+WXooP1/ew8kjOU+/4JAMi9/vksyDsOVhuEl2V5xyab4WTx0AZprujD1u 7VUDwQq9sxKjXutApNA5hSTS3uGbQqE= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-552-2KUdds_HNR-htdaUZo8i9g-1; Thu, 07 May 2026 08:27:53 -0400 X-MC-Unique: 2KUdds_HNR-htdaUZo8i9g-1 X-Mimecast-MFC-AGG-ID: 2KUdds_HNR-htdaUZo8i9g_1778156872 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-67b7e7d4a0fso578168a12.2 for ; Thu, 07 May 2026 05:27:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156872; x=1778761672; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zX3vYHYUteuuNITwnN2MGryb5ln7tP2ioM7ViT/pmII=; b=eONPNPrLyf4Ihu8+9+/f0CZ1o2WZBXTQubsQnhXLeF8qxn76AcSLccAVPOKu4ihYae rn8HuJCuRGMs2kzklv7CciUO+HgOjek3GPnGmKPF+5aatQ+lwmZb9V5dgQpoZnxLeGpM rAIX20Oe3gNxl5VAHBZ07yJAH9pc+lX4e8s74IPXT3Wbn4+cOkjubDGxm1ne0ryrtc/A 1NK+Jueep/HgEbqwu7fk5yx6DZy4UmEsoJY7CgWmp3z4MG0l9FmSSy4N9DJwIOpP88dy Synl/hUbgEK9aj1TepiwmsA5hgteQhulYGVqwI2ij2aty4EWhKJYwnvg5bWiJk2a8x4x sP5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156872; x=1778761672; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=zX3vYHYUteuuNITwnN2MGryb5ln7tP2ioM7ViT/pmII=; b=ZZslzu6KRIs7bGAlquoHEdgHAX0cNMlyRR5VllO771enUveSCP8x8QNSgjafRCM8/t KVnSPRTh7lo/zjJKdVgGFPK/5m5Tb6hgvt33HSA3njOv9I6BT80UgetbtBdJx0/vZdKe 0BgfgOX/t36eHfI0Yypi2uIXHpYaUXmYZDlqPqj51FAe/rzgXwKAGm/UNHOPlJ1jsevV HW1x5K9aKUrBV7dPQEEaB/+/g5nz46r9veRJkih4UVeUPT+yeNbg5QY9aIMXrjHbtKtY 2Ey/gov8SDt2z4RbwIsKUa67uPeHqIxCzHMN2b5gno3v1UHWTpBMfVjX/Klv/G7DpYjn 1f6Q== X-Gm-Message-State: AOJu0Yy2NAX7RjLrT50cV2Noe4fZBstFt5zP8vy9w4ludZr0LcqjnMMM +6fUwkXXo2xwZdk5xeMmpPsfxGCCOLvmoFMoo78ViVA6A/+mbPAKwuQTqrlTrJry93BtGzHrkJc vUm+uxua6I8uWmlXwjV48LSH1GNtH8WlvikxxYjBOmOYjtWOlQsIdgeU/fLHB5R0IKg== X-Gm-Gg: AeBDiesk7CtYaX7ECcsbCNryp42dY8/pSHd6JwQpz1qzMXWlwOL2rkFK0iq67wEWz+N JIVcVMnLTVtzqcAYBCNnmNg+n30zX7jTW5leMX0pZ33w++mK6B9TQajcXFY+S7PWJDM4JggVR08 fgXSs8GNFrbBl7DLmos/xNgZPuHq+vUniwhgkcE/p9I7j7nJZ7rN19GwMsS068jPYM92qtkDM1U lKXKkKTcU89BaS8DkOCpRbU0MjrBBLyA3kL9Lu67CW66WxV/63wis5p9ve8ks928nMBcfCuMeaD mt4qY034oYYssI7HPH+SCzymVnVntMTd13M/7nmB21kspMpjimyEuz5jWAeIkYFfReQQ7HTjTVI zfJGXE5tIpVzGIVLFQohU3me3ZzBR9rIGcIR8O9OsPcJkfc/maMMP93oOGPbKWb2foA== X-Received: by 2002:a17:907:2685:b0:b9c:b682:83bd with SMTP id a640c23a62f3a-bc56b93ffa6mr463718166b.4.1778156871865; Thu, 07 May 2026 05:27:51 -0700 (PDT) X-Received: by 2002:a17:907:2685:b0:b9c:b682:83bd with SMTP id a640c23a62f3a-bc56b93ffa6mr463714566b.4.1778156871084; Thu, 07 May 2026 05:27:51 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:50 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Date: Thu, 7 May 2026 12:27:30 +0000 Message-Id: <20260507122737.2804094-5-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Convert wait_caps_flush() from a silent indefinite wait into a diagnostic wait loop that periodically dumps pending cap flush state. The underlying wait semantics remain intact: callers still wait until the requested cap flushes complete. The difference is that long stalls now produce actionable diagnostics instead of looking like a silent hang. CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS limits the number of timed diagnostic dumps before the wait continues silently. When more entries exist than the per-dump limit, a truncation count is reported. When the dump iteration limit is reached, a final suppression message is emitted so the transition to silence is explicit. The diagnostic dump collects flush entry data under cap_dirty_lock into a bounded on-stack array, then prints after releasing the lock. This avoids holding the spinlock across printk calls. A null cf->ci on the global flush list indicates a bug since all cap_flush entries are initialized with a valid ci before being added. Signal this with WARN_ON_ONCE while still printing enough context for debugging. READ_ONCE is used for the i_last_cap_flush_ack field, which is read outside the inode lock domain. Flush tids are monotonically increasing and acks are processed in order under i_ceph_lock, so the latest ack tid is always the most recently written value. Add a ci pointer to struct ceph_cap_flush so that the diagnostic dump can identify which inode each pending flush belongs to. The new i_last_cap_flush_ack field tracks the latest acknowledged flush tid per inode for diagnostic correlation. This improves reset-drain observability and is also useful for existing sync and writeback troubleshooting paths. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- fs/ceph/caps.c | 10 +++++ fs/ceph/inode.c | 1 + fs/ceph/mds_client.c | 100 +++++++++++++++++++++++++++++++++++++++++-- fs/ceph/mds_client.h | 3 ++ fs/ceph/super.h | 6 +++ 5 files changed, 116 insertions(+), 4 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index cb9e78b713d9..4b37d9ffdf7f 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, =20 spin_lock(&mdsc->cap_dirty_lock); capsnap->cap_flush.tid =3D ++mdsc->last_cap_flush_tid; + capsnap->cap_flush.ci =3D ci; list_add_tail(&capsnap->cap_flush.g_list, &mdsc->cap_flush_list); if (oldest_flush_tid =3D=3D 0) @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void) return NULL; =20 cf->is_capsnap =3D false; + cf->ci =3D NULL; return cf; } =20 @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode, doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode)); =20 swap(cf, ci->i_prealloc_cap_flush); + cf->ci =3D ci; cf->caps =3D flushing; cf->wake =3D wake; =20 @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode= , u64 flush_tid, bool wake_ci =3D false; bool wake_mdsc =3D false; =20 + /* + * Flush tids are monotonically increasing and acks arrive in + * order under i_ceph_lock, so this is always the latest tid. + * Diagnostic readers use READ_ONCE() without holding the lock. + */ + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid); + list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) { /* Is this the one that was flushed? */ if (cf->tid =3D=3D flush_tid) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 4871d7ab2730..61d7c0b8161f 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -671,6 +671,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ci->i_cap_snaps); ci->i_head_snapc =3D NULL; ci->i_snap_caps =3D 0; + ci->i_last_cap_flush_ack =3D 0; =20 ci->i_last_rd =3D ci->i_last_wr =3D jiffies - 3600 * HZ; for (i =3D 0; i < CEPH_FILE_MODE_BITS; i++) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 249419c17d3c..6ab5031e697a 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2330,19 +2330,111 @@ static int check_caps_flush(struct ceph_mds_client= *mdsc, } =20 /* - * flush all dirty inode data to disk. + * Snapshot of a single cap_flush entry for diagnostic dump. + * Collected under cap_dirty_lock, printed after releasing it. + */ +struct flush_dump_entry { + u64 ino; /* inode number */ + u64 snap; /* snap id */ + int caps; /* dirty cap bits */ + u64 tid; /* flush transaction id */ + u64 last_ack; /* most recent ack tid for this inode */ + bool wake; /* whether completion was requested */ + bool is_capsnap; /* true if this is a cap snap flush */ + bool ci_null; /* true if cf->ci was unexpectedly NULL */ +}; + +/* + * Dump pending cap flushes for diagnostic purposes. * - * returns true if we've flushed through want_flush_tid + * cf->ci is safe to dereference here: cap_flush entries hold a + * reference on the inode (via the cap), and entries are removed from + * cap_flush_list under cap_dirty_lock before the cap (and thus the + * inode reference) is released. Holding cap_dirty_lock therefore + * guarantees the inode remains valid for the lifetime of the scan. + */ + +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid) +{ + struct ceph_client *cl =3D mdsc->fsc->client; + struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES]; + struct ceph_cap_flush *cf; + int n =3D 0, remaining =3D 0; + + spin_lock(&mdsc->cap_dirty_lock); + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) { + if (cf->tid > want_tid) + break; + if (n < CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES) { + struct flush_dump_entry *e =3D &entries[n++]; + + e->ci_null =3D WARN_ON_ONCE(!cf->ci); + if (!e->ci_null) { + e->ino =3D ceph_ino(&cf->ci->netfs.inode); + e->snap =3D ceph_snap(&cf->ci->netfs.inode); + e->last_ack =3D READ_ONCE(cf->ci->i_last_cap_flush_ack); + } + e->caps =3D cf->caps; + e->tid =3D cf->tid; + e->wake =3D cf->wake; + e->is_capsnap =3D cf->is_capsnap; + } else { + remaining++; + } + } + spin_unlock(&mdsc->cap_dirty_lock); + + pr_info_client(cl, "still waiting for cap flushes through %llu:\n", + want_tid); + for (int i =3D 0; i < n; i++) { + struct flush_dump_entry *e =3D &entries[i]; + + if (e->ci_null) + pr_info_client(cl, + " (null ci) %s tid=3D%llu wake=3D%d%s\n", + ceph_cap_string(e->caps), e->tid, + e->wake, + e->is_capsnap ? " is_capsnap" : ""); + else + pr_info_client(cl, + " %llx.%llx %s tid=3D%llu last_ack=3D%llu wake=3D%d%s\n", + e->ino, e->snap, + ceph_cap_string(e->caps), e->tid, + e->last_ack, e->wake, + e->is_capsnap ? " is_capsnap" : ""); + } + if (remaining) + pr_info_client(cl, " ... and %d more pending flushes\n", + remaining); +} + +/* + * Wait for all cap flushes through @want_flush_tid to complete. + * Periodically dumps pending cap flush state for diagnostics. */ static void wait_caps_flush(struct ceph_mds_client *mdsc, u64 want_flush_tid) { struct ceph_client *cl =3D mdsc->fsc->client; + int i =3D 0; + long ret; =20 doutc(cl, "want %llu\n", want_flush_tid); =20 - wait_event(mdsc->cap_flushing_wq, - check_caps_flush(mdsc, want_flush_tid)); + do { + /* 60 * HZ fits in a long on all supported architectures. */ + ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, want_flush_tid), + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ); + if (ret =3D=3D 0) { + if (i < CEPH_CAP_FLUSH_MAX_DUMP_ITERS) + dump_cap_flushes(mdsc, want_flush_tid); + else if (i =3D=3D CEPH_CAP_FLUSH_MAX_DUMP_ITERS) + pr_info_client(cl, + "still waiting for cap flushes; suppressing further dumps\n"); + i++; + } + } while (ret =3D=3D 0); =20 doutc(cl, "ok, flushed thru %llu\n", want_flush_tid); } diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index d873e784b025..8208fdf02efe 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -77,6 +77,9 @@ struct ceph_fs_client; struct ceph_cap; =20 #define MDS_AUTH_UID_ANY -1 +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60 +#define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5 +#define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5 =20 struct ceph_mds_cap_match { s64 uid; /* default to MDS_AUTH_UID_ANY */ diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 8afc6f3a10da..a4993644d543 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -239,6 +239,7 @@ struct ceph_cap_flush { bool is_capsnap; /* true means capsnap */ struct list_head g_list; // global struct list_head i_list; // per inode + struct ceph_inode_info *ci; }; =20 /* @@ -453,6 +454,11 @@ struct ceph_inode_info { struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or dirty|flushing caps */ unsigned i_snap_caps; /* cap bits for snapped files */ + /* + * Written under i_ceph_lock, read via READ_ONCE() + * from diagnostic paths. + */ + u64 i_last_cap_flush_ack; =20 unsigned long i_last_rd; unsigned long i_last_wr; --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D11463B7B93 for ; Thu, 7 May 2026 12:27:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156879; cv=none; b=e5y2zy4cmXMHpW+fwgAC69kz4Tz/vvln2AK4URIbk7zQXuKTnrdvuNi0ZoqVstLaCJhiIwlzaej7HXTuHbkgaBr9/rO5jq1VtH0g9TPXU4FYw19bCuZ0zan4Elot06ONndDuozw6nANjAW19M9kVRAvJmc1niB4V/Gop0pbwYBk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156879; c=relaxed/simple; bh=r8IRd4p/s1qev9JUdnMphmL7ccbiMxBUwKDhBH1ikl0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Ke1twM8PABzzybBRdzHcfAmpLKOq1x6n+slO5raFZYGQCGkxu55mUTEp+PtGJ+kINWeptDcR6Z4ijHv4LCk/zLCSKSndWg9e2fLgnvp4x01dqBGx97UrS6VvPKNn8y/qOnuDUrlZsXfeBJz5L2ANxmUMnKftBShf+2uxtj3DNX0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZP147zSQ; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=Coc5RhLV; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZP147zSQ"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="Coc5RhLV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U5dMqKukHCSzl33nQzKbjII2zLA98OS6tdWQIRQl+ow=; b=ZP147zSQ8Egb+chJ9LtHJeP2Z+fhwuMEEXkwoOMATUiQbXzFBHFob0SXinIvLCvBSjhc5e IqRXAY+uqiz02LH7OY/1U7V3eqBhF6LXCVoUFqVN7iF0wtTJftUeo++zWozJuFXE8zEXvw Iv9nDWTA+oExZX2mjFosqx/V7yNi6Zg= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-562-n8xL-fxxOsO60gHaFeYHeA-1; Thu, 07 May 2026 08:27:54 -0400 X-MC-Unique: n8xL-fxxOsO60gHaFeYHeA-1 X-Mimecast-MFC-AGG-ID: n8xL-fxxOsO60gHaFeYHeA_1778156874 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-bbfaa3ad205so95354766b.0 for ; Thu, 07 May 2026 05:27:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156873; x=1778761673; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=U5dMqKukHCSzl33nQzKbjII2zLA98OS6tdWQIRQl+ow=; b=Coc5RhLVX46j9KhrT5UPMenTcn+S9z/Xz4KBBdlBLCHJuBr2iDjBZAM+7Cd5/8XjLC 6zQwMn3Abe2pyVesOIz0Yd9VcN2QpIzz+TQonZpNcZoD2nNlfJ9XET8fp75X84UCW9mU zBLsr9DkWDDpK6CV6oG2FUnsJGiBcu0XsGTKTdazrRsjN21gfuNbUweFJ9N/UrY/z79D 7XmPzqTTMIShP9ZBcm4QKGjb41d28Yu5F4BHMars3BjASnv8NbNGtpqPfpb31H4T5hB4 YPfJvkh6BbSQQaRxM3XHykaUvtwkuEi3LRd7ixzwoMgU921y+/GzMso9SpK6ceBPxXTc GIbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156873; x=1778761673; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=U5dMqKukHCSzl33nQzKbjII2zLA98OS6tdWQIRQl+ow=; b=V43TEDhbpE4riHiwUxotJBTVYzQjpi8cqUE2yXdgUlNzdwCoETeJyZ/rScYodMouN+ PtSdV62xPLyJ8zYGt45vDj2/udqm2vUhk1uG8BLVjBjRnPDsRSTRGs67CiJOdDei5dzJ xj6Jdy2zIBDv5cuL63XsOkG8yiUWpHu21b1pUDxYzbGGsReK04xpZhc1TEP+QR/YNlgK vnFT27XTztp/SXqhUT79lh7AdzKni5SoLjOhbtLZL+FNuuYzjkXLFjTJSHfGKs+aS6AV xnzjO6GWygcii9cvZvKQKN6Jqa73b+hREcscHkHukGhGc3BDG83vkIxPeSkkxu/4NJhL VM5A== X-Gm-Message-State: AOJu0Yyto2C1ZqeS793p7PZAJsNVzDOuqUpt6m1p+36JUJkODVV/8R2o kbBEtJKQuQMMll6P0WaDBjkLlNZ/ZTfuoMUpkx0n3zE+gKeo5JdwRLtHF9O4Rmx1wnzOn1N5MFU D+pmIDp0cgZ9nWMomYf1kAbqMW9x1RoTS3YVm2y/38ZIyh5qINEtZxqv4XK4Q2Xzkvg== X-Gm-Gg: AeBDietSAGUxKmIsNacgyjPnb/xgA0VgSRsrCA2C2svx2SqUMadV2KGhvXoHXwDviCz tsk3eRvsmxieH2xTlJw8Q17qzFlaX8Y9hLQv29tUnlEY72gSr7klXrhQzJalVkr50wCnLsGqlP1 5sEaKXnT7nHQUE2mdRyvus/mogGzY7/Y6Sa8SER2BuluOq6sem/yu3SkLOcWBQ8HDaU2lus+wBk qleoiAc5sYgWKg8LN7Yefu0INi4fHbOYcxPCOTgx7C86oGF71uzE+I7ML+TsYp5nLTfPiWrt+av fx8hDEoqD5MZiPMJ0nHUhUFJGC1iV/+pB5Fc0CubGy/pvQhST5QdPzFeJFHXpAXRMY7OQESy19Z vlhItguSGS4k99Ulu4nRztS/i4iWoVqn3G7cM9Zy48Ts/hRnGvui0zAlKxgNh0qZSdA== X-Received: by 2002:a17:907:a02:b0:b98:8494:3174 with SMTP id a640c23a62f3a-bc85cdae255mr134899766b.24.1778156873010; Thu, 07 May 2026 05:27:53 -0700 (PDT) X-Received: by 2002:a17:907:a02:b0:b98:8494:3174 with SMTP id a640c23a62f3a-bc85cdae255mr134895966b.24.1778156872088; Thu, 07 May 2026 05:27:52 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:51 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 05/11] ceph: add client reset state machine and session teardown Date: Thu, 7 May 2026 12:27:31 +0000 Message-Id: <20260507122737.2804094-6-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result via -EAGAIN (reset failed) or 0 (success). Internal work-function errors such as -ENOMEM are not propagated to unrelated callers like open() or flock(); the detailed error remains in debugfs and tracepoints. The work function checks st->shutdown before each phase transition (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not overwritten. If destroy already took ownership, the work function releases session references and returns without touching the state. The timeout calculation for blocked-request waiters uses max_t() to prevent jiffies underflow when the deadline has already passed. The close-grace sleep before teardown is a best-effort nudge to let queued REQUEST_CLOSE messages egress; it is not a correctness requirement since the MDS still has session_autoclose as a fallback. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- fs/ceph/locks.c | 16 ++ fs/ceph/mds_client.c | 508 +++++++++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.h | 46 ++++ 3 files changed, 570 insertions(+) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index c4ff2266bb94..677221bd64e0 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_l= ock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u16 op =3D CEPH_MDS_OP_SETFILELOCK; @@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_= lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (lock_is_read(fl)) lock_cmd =3D CEPH_LOCK_SHARED; else if (lock_is_write(fl)) @@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_= lock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u8 wait =3D 0; @@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file= _lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (!lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (IS_SETLKW(cmd)) wait =3D 1; =20 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 6ab5031e697a..ce773b1095da 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -65,6 +66,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc, struct list_head *head); static void ceph_cap_release_work(struct work_struct *work); static void ceph_cap_reclaim_work(struct work_struct *work); +static void ceph_mdsc_reset_workfn(struct work_struct *work); =20 static const struct ceph_connection_operations mds_con_ops; =20 @@ -3844,6 +3846,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client = *mdsc, struct inode *dir, struct ceph_client *cl =3D mdsc->fsc->client; int err =3D 0; =20 + /* + * If a reset is in progress, wait for it to complete. + * + * This is best-effort: a request can pass this check just + * before the phase leaves IDLE and proceed concurrently with + * reset. That is acceptable because (a) such requests will + * either complete normally or fail and be retried by the + * caller, and (b) adding lock serialization here would + * penalize every request for a rare manual operation. + */ + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) { + doutc(cl, "wait_for_reset failed: %d\n", err); + return err; + } + /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */ if (req->r_inode) ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN); @@ -5266,6 +5284,474 @@ static int send_mds_reconnect(struct ceph_mds_clien= t *mdsc, return err; } =20 +const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase) +{ + switch (phase) { + case CEPH_CLIENT_RESET_IDLE: return "idle"; + case CEPH_CLIENT_RESET_QUIESCING: return "quiescing"; + case CEPH_CLIENT_RESET_DRAINING: return "draining"; + case CEPH_CLIENT_RESET_TEARDOWN: return "teardown"; + default: return "unknown"; + } +} + +/** + * ceph_mdsc_wait_for_reset - wait for an active reset to complete + * @mdsc: MDS client + * + * Returns 0 if reset completed successfully or no reset was active. + * Returns -EAGAIN if reset completed with an error, signalling the + * caller to retry. The internal error (e.g. -ENOMEM) is not propagated + * because callers like open() or flock() have no way to act on + * work-function internals. The detailed error is available via debugfs + * reset/status and tracepoints. + * Returns -ETIMEDOUT if we timed out waiting. + * Returns -ERESTARTSYS if interrupted by signal. + */ +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + unsigned long deadline =3D jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC *= HZ; + int blocked_count; + long remaining; + long wait_ret; + int ret; + + if (ceph_reset_is_idle(st)) + return 0; + + blocked_count =3D atomic_inc_return(&st->blocked_requests); + doutc(cl, "request blocked during reset, %d total blocked\n", + blocked_count); + +retry: + remaining =3D max_t(long, deadline - jiffies, 1); + wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq, + ceph_reset_is_idle(st), + remaining); + + if (wait_ret =3D=3D 0) { + atomic_dec(&st->blocked_requests); + pr_warn_client(cl, "timed out waiting for reset to complete\n"); + return -ETIMEDOUT; + } + if (wait_ret < 0) { + atomic_dec(&st->blocked_requests); + return (int)wait_ret; /* -ERESTARTSYS */ + } + + /* + * Verify phase is still IDLE under the lock. If another reset + * was scheduled between the wake-up and this check, loop back + * and wait for it to finish rather than returning a stale result. + */ + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + if (time_before(jiffies, deadline)) + goto retry; + atomic_dec(&st->blocked_requests); + return -ETIMEDOUT; + } + ret =3D st->last_errno; + spin_unlock(&st->lock); + + atomic_dec(&st->blocked_requests); + return ret ? -EAGAIN : 0; +} + +static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + /* + * If destroy already marked us as shut down, it owns the + * final bookkeeping and waiter wakeup. Just bail so we + * don't overwrite its state. + */ + if (st->shutdown) { + spin_unlock(&st->lock); + return; + } + st->last_finish =3D jiffies; + st->last_errno =3D ret; + st->phase =3D CEPH_CLIENT_RESET_IDLE; + if (ret) + st->failure_count++; + else + st->success_count++; + spin_unlock(&st->lock); + + /* Wake up all requests that were blocked waiting for reset */ + wake_up_all(&st->blocked_wq); + +} + +static void ceph_mdsc_reset_workfn(struct work_struct *work) +{ + struct ceph_mds_client *mdsc =3D + container_of(work, struct ceph_mds_client, reset_work); + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + struct ceph_mds_session **sessions =3D NULL; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + unsigned long drain_deadline; + int max_sessions, i, n =3D 0, torn_down =3D 0; + int ret =3D 0; + + spin_lock(&st->lock); + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + mutex_lock(&mdsc->mutex); + max_sessions =3D mdsc->max_sessions; + if (max_sessions <=3D 0) { + mutex_unlock(&mdsc->mutex); + goto out_complete; + } + + sessions =3D kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL); + if (!sessions) { + mutex_unlock(&mdsc->mutex); + ret =3D -ENOMEM; + pr_err_client(cl, + "manual session reset failed to allocate session array\n"); + ceph_mdsc_reset_complete(mdsc, ret); + return; + } + + for (i =3D 0; i < max_sessions; i++) { + struct ceph_mds_session *session =3D mdsc->sessions[i]; + + if (!session) + continue; + + /* + * Read session state without s_mutex to avoid nesting + * mdsc->mutex -> s_mutex, which would invert the + * s_mutex -> mdsc->mutex order used by + * cleanup_session_requests(). s_state is an int + * so loads are atomic; the teardown loop below + * handles races with concurrent state transitions. + */ + switch (READ_ONCE(session->s_state)) { + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + sessions[n++] =3D ceph_get_mds_session(session); + break; + default: + pr_info_client(cl, + "mds%d in state %s, skipping reset\n", + session->s_mds, + ceph_session_state_name(session->s_state)); + break; + } + } + mutex_unlock(&mdsc->mutex); + + pr_info_client(cl, + "manual session reset executing (sessions=3D%d, reason=3D\"%s\")\= n", + n, reason); + + if (n =3D=3D 0) { + kfree(sessions); + goto out_complete; + } + + spin_lock(&st->lock); + if (st->shutdown) { + spin_unlock(&st->lock); + goto out_sessions; + } + st->phase =3D CEPH_CLIENT_RESET_DRAINING; + spin_unlock(&st->lock); + + /* + * Best-effort drain: flush dirty state while sessions are still + * alive. New requests are blocked while phase !=3D IDLE. + * The sessions are functional, so non-stuck state drains normally. + * Stuck state (the cause of the stalemate the operator is trying + * to break) will not drain -- that is expected, and we proceed to + * forced teardown after the timeout. + * + * Four things are drained: + * 1. MDS journal -- send_flush_mdlog asks each MDS to journal + * pending unsafe operations (creates, renames, setattrs). + * 2. Unsafe requests -- bounded wait for each unsafe write + * request to reach safe status via r_safe_completion. + * 3. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on + * all sessions. Non-stuck caps flush in milliseconds. + * 4. Cap releases -- push pending cap release messages. + * + * The unsafe-request wait and cap-flush wait below provide + * the bounded drain window during which all categories can + * make progress. + */ + for (i =3D 0; i < n; i++) + send_flush_mdlog(sessions[i]); + + /* + * Both drain legs (unsafe requests and cap flushes) share a + * single deadline so the total drain time is bounded at + * CEPH_CLIENT_RESET_DRAIN_SEC. + */ + drain_deadline =3D jiffies + CEPH_CLIENT_RESET_DRAIN_SEC * HZ; + + /* + * Wait for unsafe write requests (creates, renames, setattrs) + * to reach safe status. Uses the same pattern as + * flush_mdlog_and_wait_mdsc_unsafe_requests() but bounded by + * the shared drain deadline. Requests that do not complete within + * the window are force-dropped during teardown. + */ + { + struct ceph_mds_request *req; + struct rb_node *rn; + u64 last_tid; + + mutex_lock(&mdsc->mutex); + last_tid =3D mdsc->last_tid; + mutex_unlock(&mdsc->mutex); + + mutex_lock(&mdsc->mutex); + rn =3D rb_first(&mdsc->request_tree); + while (rn) { + req =3D rb_entry(rn, struct ceph_mds_request, r_node); + if (req->r_tid > last_tid) + break; + if (req->r_op =3D=3D CEPH_MDS_OP_SETFILELOCK || + !(req->r_op & CEPH_MDS_OP_WRITE)) { + rn =3D rb_next(rn); + continue; + } + ceph_mdsc_get_request(req); + mutex_unlock(&mdsc->mutex); + + wait_for_completion_timeout(&req->r_safe_completion, + max_t(long, drain_deadline - jiffies, 1)); + + mutex_lock(&mdsc->mutex); + ceph_mdsc_put_request(req); + if (time_after(jiffies, drain_deadline)) + break; + rn =3D rb_first(&mdsc->request_tree); + } + mutex_unlock(&mdsc->mutex); + + if (time_after_eq(jiffies, drain_deadline)) + WRITE_ONCE(st->drain_timed_out, true); + } + + ceph_flush_dirty_caps(mdsc); + ceph_flush_cap_releases(mdsc); + + spin_lock(&mdsc->cap_dirty_lock); + if (!list_empty(&mdsc->cap_flush_list)) { + struct ceph_cap_flush *cf =3D + list_last_entry(&mdsc->cap_flush_list, + struct ceph_cap_flush, g_list); + u64 want_flush =3D mdsc->last_cap_flush_tid; + long drain_ret; + + /* + * Setting wake on the last entry is sufficient: flush + * entries complete in order, so when this entry finishes + * all earlier ones are already done. + */ + cf->wake =3D true; + spin_unlock(&mdsc->cap_dirty_lock); + pr_info_client(cl, + "draining (want_flush=3D%llu, %d sessions)\n", + want_flush, n); + drain_ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, + want_flush), + max_t(long, + drain_deadline - jiffies, + 1)); + if (drain_ret =3D=3D 0) { + pr_info_client(cl, + "drain timed out, proceeding with forced teardown\n"); + WRITE_ONCE(st->drain_timed_out, true); + } else { + pr_info_client(cl, "drain completed successfully\n"); + } + } else { + spin_unlock(&mdsc->cap_dirty_lock); + } + + spin_lock(&st->lock); + if (st->shutdown) { + spin_unlock(&st->lock); + goto out_sessions; + } + st->phase =3D CEPH_CLIENT_RESET_TEARDOWN; + spin_unlock(&st->lock); + + /* + * Ask each MDS to close the session before we tear it down + * locally. Without this the MDS sees only a connection drop and + * waits for the client to reconnect (up to session_autoclose + * seconds) before evicting the session and releasing locks. + * + * Reuse the normal close machinery so the session state/sequence + * snapshot is serialized under s_mutex and a racing s_seq bump + * retransmits REQUEST_CLOSE while the session remains CLOSING. + * We send all close requests first, then yield briefly to let the + * network stack transmit them before __unregister_session() + * closes the connections. + */ + for (i =3D 0; i < n; i++) { + int err; + + mutex_lock(&sessions[i]->s_mutex); + err =3D __close_session(mdsc, sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + if (err < 0) + pr_warn_client(cl, + "mds%d failed to queue close request before reset: %d\n", + sessions[i]->s_mds, err); + } + /* + * Best-effort grace period: yield briefly so the network stack + * can transmit the queued REQUEST_CLOSE messages before we tear + * down connections. Not a correctness requirement -- the MDS + * will still evict via session_autoclose if it never receives + * the close request. + * + * Event-based waiting is not viable here: there is no completion + * event for "message left the NIC," and waiting for the MDS + * SESSION_CLOSE response would re-create the stalemate that the + * reset is meant to break. + */ + if (n > 0) + msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS); + + /* + * Tear down each session: close the connection, remove all + * caps, clean up requests, then kick pending requests so they + * re-open a fresh session on the next attempt. + * + * This is modeled on the check_new_map() forced-close path + * for stopped MDS ranks - a proven pattern for hard session + * teardown. We do NOT attempt send_mds_reconnect() because + * the MDS only accepts reconnects during its own RECONNECT + * phase (after MDS restart), not from an active client. + * + * Any state that did not drain (caps that didn't flush, unsafe + * requests that the MDS didn't journal) is force-dropped here. + * This is intentional: that state is stuck and is the reason + * the operator triggered the reset. + */ + for (i =3D 0; i < n; i++) { + int mds =3D sessions[i]->s_mds; + + pr_info_client(cl, "mds%d resetting session\n", mds); + + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || + mdsc->sessions[mds] !=3D sessions[i]) { + pr_info_client(cl, + "mds%d session already torn down, skipping\n", + mds); + mutex_unlock(&mdsc->mutex); + ceph_put_mds_session(sessions[i]); + sessions[i] =3D NULL; + continue; + } + sessions[i]->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, sessions[i]); + __wake_requests(mdsc, &sessions[i]->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&sessions[i]->s_mutex); + cleanup_session_requests(mdsc, sessions[i]); + remove_session_caps(sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + ceph_put_mds_session(sessions[i]); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, mds); + mutex_unlock(&mdsc->mutex); + + torn_down++; + pr_info_client(cl, "mds%d session reset complete\n", mds); + } + + kfree(sessions); + + spin_lock(&st->lock); + st->sessions_reset =3D torn_down; + spin_unlock(&st->lock); + +out_complete: + ceph_mdsc_reset_complete(mdsc, ret); + return; + +out_sessions: + /* shutdown =3D=3D true: ceph_mdsc_destroy() owns the final transition. */ + for (i =3D 0; i < n; i++) + ceph_put_mds_session(sessions[i]); + kfree(sessions); +} + +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_fs_client *fsc =3D mdsc->fsc; + const char *msg =3D (reason && reason[0]) ? reason : "manual"; + int mount_state; + + mount_state =3D READ_ONCE(fsc->mount_state); + if (mount_state !=3D CEPH_MOUNT_MOUNTED) { + pr_warn_client(fsc->client, + "reset rejected: mount_state=3D%d (not mounted)\n", + mount_state); + return -EINVAL; + } + + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + return -EBUSY; + } + + st->phase =3D CEPH_CLIENT_RESET_QUIESCING; + st->last_start =3D jiffies; + st->last_errno =3D 0; + st->drain_timed_out =3D false; + st->sessions_reset =3D 0; + st->trigger_count++; + strscpy(st->last_reason, msg, sizeof(st->last_reason)); + spin_unlock(&st->lock); + + if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) { + spin_lock(&st->lock); + st->phase =3D CEPH_CLIENT_RESET_IDLE; + st->last_errno =3D -EALREADY; + st->last_finish =3D jiffies; + st->failure_count++; + spin_unlock(&st->lock); + wake_up_all(&st->blocked_wq); + return -EALREADY; + } + + pr_info_client(mdsc->fsc->client, + "manual session reset scheduled (reason=3D\"%s\")\n", + msg); + return 0; +} + =20 /* * compare old and new mdsmaps, kicking requests @@ -5811,6 +6297,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) INIT_LIST_HEAD(&mdsc->dentry_leases); INIT_LIST_HEAD(&mdsc->dentry_dir_leases); =20 + spin_lock_init(&mdsc->reset_state.lock); + init_waitqueue_head(&mdsc->reset_state.blocked_wq); + atomic_set(&mdsc->reset_state.blocked_requests, 0); + INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn); + ceph_caps_init(mdsc); ceph_adjust_caps_max_min(mdsc, fsc->mount_options); =20 @@ -6336,6 +6827,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc) /* flush out any connection work with references to us */ ceph_msgr_flush(); =20 + /* + * Mark reset as failed and wake any blocked waiters before + * cancelling, so unmount doesn't stall on blocked_wq timeout + * if cancel_work_sync() prevents the work from running. + */ + spin_lock(&mdsc->reset_state.lock); + mdsc->reset_state.shutdown =3D true; + if (mdsc->reset_state.phase !=3D CEPH_CLIENT_RESET_IDLE) { + mdsc->reset_state.phase =3D CEPH_CLIENT_RESET_IDLE; + mdsc->reset_state.last_errno =3D -ESHUTDOWN; + mdsc->reset_state.last_finish =3D jiffies; + mdsc->reset_state.failure_count++; + } + spin_unlock(&mdsc->reset_state.lock); + wake_up_all(&mdsc->reset_state.blocked_wq); + + cancel_work_sync(&mdsc->reset_work); ceph_mdsc_stop(mdsc); =20 ceph_metric_destroy(&mdsc->metric); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 8208fdf02efe..b1a0621cd37e 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -80,7 +80,47 @@ struct ceph_cap; #define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60 #define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5 #define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5 +#define CEPH_CLIENT_RESET_REASON_LEN 64 +#define CEPH_CLIENT_RESET_DRAIN_SEC 30 +#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100 +#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120 =20 +enum ceph_client_reset_phase { + CEPH_CLIENT_RESET_IDLE =3D 0, + /* + * QUIESCING is set synchronously by schedule_reset() before the + * workqueue item is dispatched. It gates new requests (any + * phase !=3D IDLE blocks callers) during the window between + * scheduling and the work function's transition to DRAINING. + */ + CEPH_CLIENT_RESET_QUIESCING, + CEPH_CLIENT_RESET_DRAINING, + CEPH_CLIENT_RESET_TEARDOWN, +}; + +struct ceph_client_reset_state { + spinlock_t lock; /* protects all fields below */ + u64 trigger_count; /* number of resets triggered */ + u64 success_count; /* number of successful resets */ + u64 failure_count; /* number of failed resets */ + unsigned long last_start; /* jiffies when last reset started */ + unsigned long last_finish; /* jiffies when last reset finished */ + int last_errno; /* result of most recent reset */ + enum ceph_client_reset_phase phase; /* current reset phase */ + bool drain_timed_out; /* drain exceeded timeout */ + bool shutdown; /* destroy in progress */ + int sessions_reset; /* sessions torn down in last reset */ + char last_reason[CEPH_CLIENT_RESET_REASON_LEN]; /* operator-supplied reas= on */ + + /* Request blocking during reset */ + wait_queue_head_t blocked_wq; /* waitqueue for blocked callers */ + atomic_t blocked_requests; /* count of blocked callers */ +}; + +static inline bool ceph_reset_is_idle(struct ceph_client_reset_state *st) +{ + return READ_ONCE(st->phase) =3D=3D CEPH_CLIENT_RESET_IDLE; +} struct ceph_mds_cap_match { s64 uid; /* default to MDS_AUTH_UID_ANY */ u32 num_gids; @@ -543,6 +583,8 @@ struct ceph_mds_client { struct list_head dentry_dir_leases; /* lru list */ =20 struct ceph_client_metric metric; + struct work_struct reset_work; + struct ceph_client_reset_state reset_state; struct ceph_subvolume_metrics_tracker subvol_metrics; =20 /* Subvolume metrics send tracking */ @@ -574,10 +616,14 @@ extern struct ceph_mds_session * __ceph_lookup_mds_session(struct ceph_mds_client *, int mds); =20 extern const char *ceph_session_state_name(int s); +extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phas= e); =20 extern struct ceph_mds_session * ceph_get_mds_session(struct ceph_mds_session *s); extern void ceph_put_mds_session(struct ceph_mds_session *s); +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason); +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc); =20 extern int ceph_mdsc_init(struct ceph_fs_client *fsc); extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc); --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54ADC3C1994 for ; Thu, 7 May 2026 12:27:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156880; cv=none; b=mWyrylitM2msjnR65AUcR3JYeMNlJqDxRnnv7/FFmSXPZQ+yzk6h5tTawKgSIOSAEgT0X0YEp2IaI9+HbF+OBV4s8qMbyFJ6+cU5wDwVrRTWjzbHC13RaIPYPAisiGYh95pP884mrbXokcH0Lx9lBU6d17iNwWA8Wt17+H5juBc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156880; c=relaxed/simple; bh=mry7fPSwDpLlrYEx/LJ8bD4menn0NAgWZBGmTokNjUw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iM91IGbP41W3p2IAEvSYjW1Cqma3ceUO/j6M46LZBQy1X5YXp9WlaUMI8wviyHSfZYpWJjWlHEwg1e7X74kf9QEDe1m1151bCiqS160jzAkxWtYKEm/fWAdgM72nVVKT316t+MFDnYbyKiMSH35nChBc0I0gutXbdrfcMz/BLWo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Y5EGiFik; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=KR8f4Wbk; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Y5EGiFik"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="KR8f4Wbk" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156877; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gdw4QjJQwh9CCkmForNjrUpru0aKscBcQPvRnVkBnYU=; b=Y5EGiFikfu4WeYdgxiMSZe4ZFc+N2fhuceYjczpZpUmERnDZKWyuA6ksFU4SpbYGNWREPS gIkE+AoRhBpvYy0aEMn3lpeXA8HqVoG3JnyPiK/lx8/At4kUK+cU3H2ijuq8NX9W7zOKos TY/+aZ+EUU6fqTMRIMNpz9QKCkxQeGo= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-504-ofSFKY05Ml2f-UH3VajCrw-1; Thu, 07 May 2026 08:27:56 -0400 X-MC-Unique: ofSFKY05Ml2f-UH3VajCrw-1 X-Mimecast-MFC-AGG-ID: ofSFKY05Ml2f-UH3VajCrw_1778156874 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-67c2b480857so1316474a12.3 for ; Thu, 07 May 2026 05:27:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156874; x=1778761674; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=gdw4QjJQwh9CCkmForNjrUpru0aKscBcQPvRnVkBnYU=; b=KR8f4WbkolA/9qs/iHwPwLbz6XOM2goOxxzriJTzJxdw84mFbkTLzej0xO55syNpAu ioK2bDR1Y306AWDlbfUy7k9IRoNrXo4DqF8+zg0FNBWxaYiUrPjsTcelDTKW4VrTg1P/ j+uun6lLXqfmUvHvlzQ3pbhunZr4eBNMMDIsM8EMYakpsbYwcN/0GLI2zmQw6Xd5b3G0 6FufYaqz5X22Y6YBfHRHogv/sY7SSmiK2rqsDbQQMPxG12Gz77ETnrS/dibaWqnbDPkl 6ZRK6lTb19fEeGH2Z7LzbXP8IKw6lrGiEMgGkvZxeBUG6S1XiScShV72jwCvSEU3MPa1 AQKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156874; x=1778761674; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=gdw4QjJQwh9CCkmForNjrUpru0aKscBcQPvRnVkBnYU=; b=H7aatr26Mc9gPvm92l6og+Vyxebv6d1AZV2xrPW8RR94CsZkO0rP6PFUY7eL8PutNP v9rHRj5EnoLxQgo0odLrZ7g1visGEEECOmwgDBRfPFFGwVIwVRUGDknEo3gd+LCsK/f0 fIspbS7dIXUMJ+sv6uS5WFzpudkAO6+vSKWdjiWqZU9z1gG6s1law8IIVEyOpqK57Fsr hj59QR5NJFKZe7wF0QQ3xs81px6IcRN/0wYcIErSVGLbokoUC66OhRu3QnVP/6U3xW4H VMMKNlH3Q2LT1fd2UzbYYzCGxr6UJPMocJORZINaW8EMmu4vbSgQ9eP+CgcRE1oh+Q8e wi4w== X-Gm-Message-State: AOJu0YyDRvl1EL/PaRcYmR+qDBtoz9eJ9Sh4K3pq+4Kqh6wpEfMh1pLq tCHk4B6gu8DqFgTgABWuyrfW36x/NNssk0nKMqKG1hGU2YFL5imjUU2FczmrIodpHh9dN1Nhc+q 09MKctQGiq24Dc94GHBtILjfssXWlHnlWC8YNLVj30b6tsQ/6x1IiD/N8u5Ra2yLQ4o7R1Z/n6J ur X-Gm-Gg: AeBDietW+ld2huHseF7cIpH5wZ23KjZyvLw45LGiaPJAsXTM0nFukfwiMUUnb5I28bD 3B5oZo5bDqD0SiASpiU3kySYKNkEEpouIeeG3Dy9s9tDxBPWQgVZkKUbSiaQLzJ56Tx38RLZzGD gBHhBYfcRwAX7jdgoExaHsFHgxuv7MsBXKKakPYCFJszTUHRyQZaPQehpeB97+CCRqHtWPIks3W ckmEgFNVe/coKb2/BOibUiT5tWZBQ5UHFu+TBBB8CvXjdSCWJUk2gnwyP4O4dHl9KnOozN5UOgb HFdAj8JFLAiPJsfg0JrJswh/6cK77knU1Y38vuryvAwxkt6eupOipGiYPMNoY08JzXNGchz4Kk2 2Mgb9wuw2FBBi92+QNlFERvmK0nz7I0Qm0YfpcZqz3VyX1FSZsi33u7ZFdJGALxuOUQ== X-Received: by 2002:a17:907:d114:b0:bc6:3181:7711 with SMTP id a640c23a62f3a-bc6318178ddmr382206866b.35.1778156873477; Thu, 07 May 2026 05:27:53 -0700 (PDT) X-Received: by 2002:a17:907:d114:b0:bc6:3181:7711 with SMTP id a640c23a62f3a-bc6318178ddmr382202966b.35.1778156872910; Thu, 07 May 2026 05:27:52 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:52 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Date: Thu, 7 May 2026 12:27:32 +0000 Message-Id: <20260507122737.2804094-7-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the debugfs and trace plumbing used to trigger and observe manual client reset. The reset interface exposes a trigger file for operator-initiated reset and a status file for tracking the most recent run. The tracepoints record scheduling, completion, and blocked caller behavior so reset progress can be diagnosed from the client side. debugfs layout under /sys/kernel/debug/ceph//reset/: trigger - write to initiate a manual reset status - read to see the most recent reset result The reset directory is cleaned up via debugfs_remove_recursive() on the parent, so individual file dentries are not stored. Tracepoints: ceph_client_reset_schedule - reset queued ceph_client_reset_complete - reset finished (success or failure) ceph_client_reset_blocked - caller blocked waiting for reset ceph_client_reset_unblocked - caller unblocked after reset All tracepoints use a null-safe access for monc.auth->global_id to guard against early-init or late-teardown edge cases. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- fs/ceph/debugfs.c | 103 ++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.c | 7 +++ fs/ceph/super.h | 1 + include/trace/events/ceph.h | 67 +++++++++++++++++++++++ 4 files changed, 178 insertions(+) diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index e2463f93cf6b..18eb5da03411 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -9,6 +9,7 @@ #include #include #include +#include #include =20 #include @@ -392,6 +393,90 @@ static int status_show(struct seq_file *s, void *p) return 0; } =20 +static int reset_status_show(struct seq_file *s, void *p) +{ + struct ceph_fs_client *fsc =3D s->private; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + struct ceph_client_reset_state *st; + u64 trigger =3D 0, success =3D 0, failure =3D 0; + unsigned long last_start =3D 0, last_finish =3D 0; + int last_errno =3D 0; + enum ceph_client_reset_phase phase =3D CEPH_CLIENT_RESET_IDLE; + bool drain_timed_out =3D false; + int sessions_reset =3D 0; + int blocked_requests =3D 0; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + + if (!mdsc) + return 0; + + st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + trigger =3D st->trigger_count; + success =3D st->success_count; + failure =3D st->failure_count; + last_start =3D st->last_start; + last_finish =3D st->last_finish; + last_errno =3D st->last_errno; + phase =3D st->phase; + drain_timed_out =3D st->drain_timed_out; + sessions_reset =3D st->sessions_reset; + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + blocked_requests =3D atomic_read(&st->blocked_requests); + + seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase)); + seq_printf(s, "trigger_count: %llu\n", trigger); + seq_printf(s, "success_count: %llu\n", success); + seq_printf(s, "failure_count: %llu\n", failure); + if (last_start) + seq_printf(s, "last_start_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_start)); + else + seq_puts(s, "last_start_ms_ago: (never)\n"); + if (last_finish) + seq_printf(s, "last_finish_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_finish)); + else + seq_puts(s, "last_finish_ms_ago: (never)\n"); + seq_printf(s, "last_errno: %d\n", last_errno); + seq_printf(s, "last_reason: %s\n", + reason[0] ? reason : "(none)"); + seq_printf(s, "drain_timed_out: %s\n", + drain_timed_out ? "yes" : "no"); + seq_printf(s, "sessions_reset: %d\n", sessions_reset); + seq_printf(s, "blocked_requests: %d\n", blocked_requests); + + return 0; +} + +static ssize_t reset_trigger_write(struct file *file, const char __user *b= uf, + size_t len, loff_t *ppos) +{ + struct ceph_fs_client *fsc =3D file->private_data; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + size_t copy; + int ret; + + if (!mdsc) + return -ENODEV; + + copy =3D min_t(size_t, len, sizeof(reason) - 1); + if (copy && copy_from_user(reason, buf, copy)) + return -EFAULT; + reason[copy] =3D '\0'; + strim(reason); + + ret =3D ceph_mdsc_schedule_reset(mdsc, reason); + if (ret) + return ret; + + return len; +} + static int subvolume_metrics_show(struct seq_file *s, void *p) { struct ceph_fs_client *fsc =3D s->private; @@ -450,6 +535,7 @@ DEFINE_SHOW_ATTRIBUTE(mdsc); DEFINE_SHOW_ATTRIBUTE(caps); DEFINE_SHOW_ATTRIBUTE(mds_sessions); DEFINE_SHOW_ATTRIBUTE(status); +DEFINE_SHOW_ATTRIBUTE(reset_status); DEFINE_SHOW_ATTRIBUTE(metrics_file); DEFINE_SHOW_ATTRIBUTE(metrics_latency); DEFINE_SHOW_ATTRIBUTE(metrics_size); @@ -521,6 +607,13 @@ static int metric_features_show(struct seq_file *s, vo= id *p) =20 DEFINE_SHOW_ATTRIBUTE(metric_features); =20 +static const struct file_operations ceph_reset_trigger_fops =3D { + .owner =3D THIS_MODULE, + .open =3D simple_open, + .write =3D reset_trigger_write, + .llseek =3D noop_llseek, +}; + /* * debugfs */ @@ -554,6 +647,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc) debugfs_remove(fsc->debugfs_caps); debugfs_remove(fsc->debugfs_status); debugfs_remove(fsc->debugfs_mdsc); + debugfs_remove_recursive(fsc->debugfs_reset_dir); debugfs_remove(fsc->debugfs_subvolume_metrics); debugfs_remove_recursive(fsc->debugfs_metrics_dir); doutc(fsc->client, "done\n"); @@ -602,6 +696,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc) fsc, &caps_fops); =20 + fsc->debugfs_reset_dir =3D debugfs_create_dir("reset", + fsc->client->debugfs_dir); + debugfs_create_file("trigger", 0200, + fsc->debugfs_reset_dir, fsc, + &ceph_reset_trigger_fops); + debugfs_create_file("status", 0400, + fsc->debugfs_reset_dir, fsc, + &reset_status_fops); + fsc->debugfs_status =3D debugfs_create_file("status", 0400, fsc->client->debugfs_dir, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index ce773b1095da..b16638ebff7f 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -5324,6 +5324,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *= mdsc) blocked_count =3D atomic_inc_return(&st->blocked_requests); doutc(cl, "request blocked during reset, %d total blocked\n", blocked_count); + trace_ceph_client_reset_blocked(mdsc, blocked_count); =20 retry: remaining =3D max_t(long, deadline - jiffies, 1); @@ -5334,10 +5335,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (wait_ret =3D=3D 0) { atomic_dec(&st->blocked_requests); pr_warn_client(cl, "timed out waiting for reset to complete\n"); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } if (wait_ret < 0) { atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret); return (int)wait_ret; /* -ERESTARTSYS */ } =20 @@ -5352,12 +5355,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (time_before(jiffies, deadline)) goto retry; atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } ret =3D st->last_errno; spin_unlock(&st->lock); =20 atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, ret); return ret ? -EAGAIN : 0; } =20 @@ -5387,6 +5392,7 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_= client *mdsc, int ret) /* Wake up all requests that were blocked waiting for reset */ wake_up_all(&st->blocked_wq); =20 + trace_ceph_client_reset_complete(mdsc, ret); } =20 static void ceph_mdsc_reset_workfn(struct work_struct *work) @@ -5749,6 +5755,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *= mdsc, pr_info_client(mdsc->fsc->client, "manual session reset scheduled (reason=3D\"%s\")\n", msg); + trace_ceph_client_reset_schedule(mdsc, msg); return 0; } =20 diff --git a/fs/ceph/super.h b/fs/ceph/super.h index a4993644d543..1d6aab060780 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -179,6 +179,7 @@ struct ceph_fs_client { struct dentry *debugfs_status; struct dentry *debugfs_mds_sessions; struct dentry *debugfs_metrics_dir; + struct dentry *debugfs_reset_dir; struct dentry *debugfs_subvolume_metrics; #endif =20 diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h index 08cb0659fbfc..1b990632f62b 100644 --- a/include/trace/events/ceph.h +++ b/include/trace/events/ceph.h @@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps, __entry->mseq) ); =20 +/* + * Client reset tracepoints - identify the client by its monitor- + * assigned global_id so traces remain meaningful when kernel pointer + * hashing is enabled. + */ +TRACE_EVENT(ceph_client_reset_schedule, + TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason), + TP_ARGS(mdsc, reason), + TP_STRUCT__entry( + __field(u64, client_id) + __string(reason, reason ? reason : "") + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __assign_str(reason); + ), + TP_printk("client_id=3D%llu reason=3D%s", + __entry->client_id, __get_str(reason)) +); + +TRACE_EVENT(ceph_client_reset_complete, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + +TRACE_EVENT(ceph_client_reset_blocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count), + TP_ARGS(mdsc, blocked_count), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, blocked_count) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->blocked_count =3D blocked_count; + ), + TP_printk("client_id=3D%llu blocked_count=3D%d", __entry->client_id, + __entry->blocked_count) +); + +TRACE_EVENT(ceph_client_reset_unblocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + #undef EM #undef E_ #endif /* _TRACE_CEPH_H */ --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E58E3B9DBB for ; Thu, 7 May 2026 12:27:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156880; cv=none; b=EUgeuP6paELS7kPrbuXEbH7buKhrcx4cw9uKYimjbyNaprzrilTANz7eqv4Fp7BLa1FFi7pcSjrIJPOOCjtutQGotk7/kzb4uy+gLsgI1TXhC/cjsqQ2u0wcNm1GGFh+FGcdJ2mf5Ok6A/wxN2XDEW0XFznQ+HT0uYqvxb76Jvo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156880; c=relaxed/simple; bh=T3SYwNAcawA/E2EIqbZFX+vNF+vMw0ueYcTwcN8x72o=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=fi/8+Vbm106ZIGc7RPD6nUUZt9bORhaO2vnASiRpvPBzeNGQ0uDgufrs/vLCHlgQUnhPnJsrODrO26drvS7cNVbd6OjcWhvX8nUPeQIZKrACHnmHnFdYQFd4kr8MZUaFOQqzkw8rxmyXI7Maso/DfkvCveFYPI2XX63rBaugAaQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QVJjGl+0; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=A5eUq+Fq; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QVJjGl+0"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="A5eUq+Fq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156877; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=QVJjGl+0OhgmJemkI+rrj75MtOYv/UOUFuQZ6AR5WMdWtDrSw+/7q+GWQInPMSSaARv2+z VW2tQIAa/5LTswjZVVW+tGXNcnVO1j/pCDcjDoJwskKFXHbxwz3S4YMmYVmD0drEu/lJoY YJYarrbF5rvjUZumSTSYJw2H5HKlsMo= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-97-XZfyy2z-P2q87TiuTD6h6Q-1; Thu, 07 May 2026 08:27:56 -0400 X-MC-Unique: XZfyy2z-P2q87TiuTD6h6Q-1 X-Mimecast-MFC-AGG-ID: XZfyy2z-P2q87TiuTD6h6Q_1778156875 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-ba661b6c550so76853766b.0 for ; Thu, 07 May 2026 05:27:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156875; x=1778761675; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=A5eUq+Fq1S32GCmgm4N9j9bXoXc5DP04ZTnFU6ubY+X7O3GuP0eSsxq30yrN6JjV8v 4GBnANP+8Fkz0Xtg2LbB5NZIvmJlzhSx0Lqo4EuFOjgOz+zSCZUAlCXuPv54QRrCYUwI UzF72YBj5txkYlE2BqybPBDysuFod7Pgxu36zpPkhJUBZcoS0TkIBv5DXS3KKQ9PFPsE 2eIzzcJRbkz35gA4r3JVWKoB4raw9p/sxqIBgVyXsyuWJKYNj0zz7IHF0r8xsP7mKCGl 63IkIKBbhTk0cbAXnramp5qWrpAcuPfgeEmmU0hgHYToBha1fpbC7G0PNsrvNc96I+Uh T1aQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156875; x=1778761675; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=jFK7FHBe2r2Xiz1EWTjqAWs7YkJ8XNEzrWwis7JPDImIS/pVMTw/1CqavnmvUZk7hX tRwhGqjtwT1uJpj6QB4Xo04OLqG3f9B5eDfKuFpJ3A0CdJKGD6B3jUz81EHAQZ61OP7T oXhM8cIKeNevm8YbaZXMEaunrj+qvPuIbYnUDUvLneiopieyYlhlnKzY28YAWClXp5KM jU4wVwplRfD6NofQG9Dn4H1Et/jOFITpWRGlCha8a+nvHiNFALb/kuD0Vyam33t7IHry ha6/8d3YTpZmPOa/sK2NuTBGbni+g20xHbgfzbwIPzHn0bGSlvhMlyDc2yJf6WJOqghl JKsg== X-Gm-Message-State: AOJu0YyVTm5zIUG8hDXLw6KoW00FA+hyRUedWLpOHlr5q3a9JRXT4Yg1 sBXItvrnMWF0A7Rh4k/y4muFp/eni88IEe2/JlezuG2iHQ0N+Ukhb/VhkpJrhT9kTo7RPVF1enz RNtCrHDSUGR8RsyTaVScGuqeKdg9natc8nAsjBmXHzbjftRwrI7v6YsZz3VoFeeruog== X-Gm-Gg: AeBDiesA0E9NNzPgUdoIx9I1dj5H9S0KLv387735xXKYGthzyssQ6jc2WmCCq7WWmUG ijThIGkFf+cpoUDuU1jTUs3r7bWbVUOxCN2xGAzhCaJTg3Ca2xv5DwbVniMBWAgf2lLaHl/wube wAVmIjNT71QNk1NakNCXIQpDhOlxt0tgaXsrwVh9+BdODrQ1ap1ipczrQapOlj3VDOmMRn3n+B3 38WkivjHIkQGlevYimILu6uOVTsn/Z1Bi5LlEs5S5UKii6RTSGkouHx5kgciUaTws9DkitRTyoX DpD58tH6rR3k3dF/+UecOMeUDTbTHnuNVuj+bfO0eUM1YV8YbdvCacyG1NkSKP8qI617zmdz4hM D63Up5Ou3g1VYQEC2HFkwKedUAuTa7G8eazvCa3X8pGi6Ynv4BE50PrZTM4L9W1h/qQ== X-Received: by 2002:a17:907:874b:b0:bc3:cb1b:ed6a with SMTP id a640c23a62f3a-bc56c92bec6mr473714066b.15.1778156874622; Thu, 07 May 2026 05:27:54 -0700 (PDT) X-Received: by 2002:a17:907:874b:b0:bc3:cb1b:ed6a with SMTP id a640c23a62f3a-bc56c92bec6mr473708866b.15.1778156873732; Thu, 07 May 2026 05:27:53 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:53 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 07/11] selftests: ceph: add reset consistency checker Date: Thu, 7 May 2026 12:27:33 +0000 Message-Id: <20260507122737.2804094-8-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a Python post-run validator for the CephFS client reset stress test. The script reads data files written by the stress runner and checks that every file was either written completely or is missing, with no partial or corrupted content. This is a prerequisite for the stress test script which invokes it after each run. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- .../filesystems/ceph/validate_consistency.py | 297 ++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consi= stency.py diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.= py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py new file mode 100755 index 000000000000..c230a59bdb3a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py @@ -0,0 +1,297 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import argparse +import bisect +import hashlib +import json +import os +from pathlib import Path + + +def sha256_file(path: Path) -> str: + digest =3D hashlib.sha256() + with path.open("rb") as handle: + while True: + chunk =3D handle.read(1 << 20) + if not chunk: + break + digest.update(chunk) + return digest.hexdigest() + + +def parse_io_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 5: + raise ValueError(f"io log line {line_no}: expected 5 colum= ns, got {len(parts)}") + ts_ms, seq, logical_id, relpath, digest =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "relpath": relpath, + "digest": digest, + } + ) + return records + + +def parse_rename_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) =3D=3D 6: + ts_ms, seq, logical_id, src_rel, dst_rel, rc =3D parts + elif len(parts) =3D=3D 7: + ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = =3D parts + _ =3D worker_id # worker id is informational only + else: + raise ValueError( + f"rename log line {line_no}: expected 6 or 7 columns, = got {len(parts)}" + ) + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "src_rel": src_rel, + "dst_rel": dst_rel, + "rc": int(rc), + } + ) + return records + + +def parse_reset_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 4: + raise ValueError(f"reset log line {line_no}: expected 4 co= lumns, got {len(parts)}") + ts_ms, seq, reason, rc =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "reason": reason, + "rc": int(rc), + } + ) + return records + + +def parse_status_file(path: Path): + status =3D {} + if not path.exists(): + return status + with path.open("r", encoding=3D"utf-8") as handle: + for line in handle: + line =3D line.strip() + if not line or ":" not in line: + continue + key, value =3D line.split(":", 1) + status[key.strip()] =3D value.strip() + return status + + +def to_int(value: str, default: int =3D 0): + try: + return int(value) + except Exception: + return default + + +def validate_namespace(data_dir: Path, file_count: int, issues): + actual_locations =3D {} + actual_paths =3D {} + for logical_id in range(file_count): + name =3D f"file_{logical_id:05d}" + found =3D [] + for subdir in ("A", "B"): + candidate =3D data_dir / subdir / name + if candidate.exists(): + found.append((subdir, candidate)) + if len(found) !=3D 1: + issues.append( + f"namespace invariant failed for logical_id=3D{logical_id:= 05d}: expected exactly one file in A/B, found {len(found)}" + ) + continue + actual_locations[logical_id] =3D found[0][0] + actual_paths[logical_id] =3D found[0][1] + return actual_locations, actual_paths + + +def validate_rename_invariant(rename_records, actual_locations, issues): + expected_locations =3D {} + for rec in rename_records: + if rec["rc"] !=3D 0: + continue + dst =3D rec["dst_rel"] + if "/" not in dst: + continue + expected_locations[rec["logical_id"]] =3D dst.split("/", 1)[0] + + for logical_id, expected in expected_locations.items(): + actual =3D actual_locations.get(logical_id) + if actual is None: + continue + if actual !=3D expected: + issues.append( + f"rename invariant failed for logical_id=3D{logical_id:05d= }: expected location=3D{expected}, actual=3D{actual}" + ) + + +def validate_data_invariant(io_records, actual_paths, issues): + expected_hash =3D {} + for rec in io_records: + digest =3D rec["digest"] + if not digest: + continue + expected_hash[rec["logical_id"]] =3D digest + + for logical_id, digest in expected_hash.items(): + path =3D actual_paths.get(logical_id) + if path is None: + continue + actual_digest =3D sha256_file(path) + if digest !=3D actual_digest: + issues.append( + f"data invariant failed for logical_id=3D{logical_id:05d}:= expected digest=3D{digest}, actual digest=3D{actual_digest}" + ) + + +def validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues): + if not args.expect_reset: + return + + successful_reset_times =3D [rec["ts_ms"] for rec in reset_records if r= ec["rc"] =3D=3D 0] + if not successful_reset_times: + issues.append("expected reset activity but no successful reset tri= gger was observed") + + phase =3D status.get("phase") + blocked_requests =3D to_int(status.get("blocked_requests", "0"), defau= lt=3D-1) + last_errno =3D to_int(status.get("last_errno", "0"), default=3D1) + failure_count =3D to_int(status.get("failure_count", "0"), default=3D-= 1) + + if phase is None: + issues.append("missing final reset status file or phase field") + elif phase.lower() !=3D "idle": + issues.append(f"recovery invariant failed: phase=3D{phase}, expect= ed idle") + + if blocked_requests !=3D 0: + issues.append(f"recovery invariant failed: blocked_requests=3D{blo= cked_requests}, expected 0") + if last_errno !=3D 0: + issues.append(f"recovery invariant failed: last_errno=3D{last_errn= o}, expected 0") + if failure_count > 0: + issues.append( + f"recovery invariant failed: failure_count=3D{failure_count}, " + "one or more resets failed during the run" + ) + + op_times =3D [rec["ts_ms"] for rec in io_records] + op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] = =3D=3D 0) + op_times.sort() + + if successful_reset_times and not op_times: + issues.append("recovery SLO failed: no workload completion events = were recorded") + return + + slo_ms =3D args.slo_seconds * 1000 + for ts in successful_reset_times: + idx =3D bisect.bisect_left(op_times, ts) + if idx >=3D len(op_times): + issues.append(f"recovery SLO failed: no operation completion o= bserved after reset at ts_ms=3D{ts}") + continue + delta =3D op_times[idx] - ts + if delta > slo_ms: + issues.append( + f"recovery SLO failed: first post-reset completion at {del= ta}ms exceeds threshold {slo_ms}ms (reset ts_ms=3D{ts})" + ) + + +def main(): + parser =3D argparse.ArgumentParser(description=3D"Validate Ceph reset = stress artifacts") + parser.add_argument("--data-dir", required=3DTrue) + parser.add_argument("--file-count", required=3DTrue, type=3Dint) + parser.add_argument("--io-log", required=3DTrue) + parser.add_argument("--rename-log", required=3DTrue) + parser.add_argument("--reset-log", required=3DTrue) + parser.add_argument("--status-final", required=3DFalse, default=3D"") + parser.add_argument("--slo-seconds", required=3DFalse, type=3Dint, def= ault=3D30) + parser.add_argument("--expect-reset", action=3D"store_true") + parser.add_argument("--report-json", required=3DFalse, default=3D"") + args =3D parser.parse_args() + + data_dir =3D Path(args.data_dir) + io_log =3D Path(args.io_log) + rename_log =3D Path(args.rename_log) + reset_log =3D Path(args.reset_log) + status_final =3D Path(args.status_final) if args.status_final else Pat= h("__missing_status__") + + issues =3D [] + + if not data_dir.exists(): + issues.append(f"data directory is missing: {data_dir}") + + try: + io_records =3D parse_io_log(io_log) + rename_records =3D parse_rename_log(rename_log) + reset_records =3D parse_reset_log(reset_log) + except Exception as exc: + issues.append(f"log parsing failed: {exc}") + io_records =3D [] + rename_records =3D [] + reset_records =3D [] + + status =3D parse_status_file(status_final) + + actual_locations, actual_paths =3D validate_namespace(data_dir, args.f= ile_count, issues) + validate_rename_invariant(rename_records, actual_locations, issues) + validate_data_invariant(io_records, actual_paths, issues) + validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues) + + report =3D { + "file_count": args.file_count, + "io_records": len(io_records), + "rename_records": len(rename_records), + "reset_records": len(reset_records), + "expect_reset": args.expect_reset, + "issues": issues, + } + + if args.report_json: + report_path =3D Path(args.report_json) + report_path.write_text(json.dumps(report, indent=3D2, sort_keys=3D= True), encoding=3D"utf-8") + + if issues: + print("FAIL: consistency validation found issues") + for issue in issues: + print(f" - {issue}") + raise SystemExit(1) + + print("PASS: consistency validation succeeded") + + +if __name__ =3D=3D "__main__": + main() --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 683023F0761 for ; Thu, 7 May 2026 12:28:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156884; cv=none; b=okUkNeMfGkAVOwjjvCdTA/Vg3ydIHAoVJu0OHOm3NGtlQPdTblYFd0rAZCD52OBjdazFrcLp/6+HwlCtH7+4TIHM72gBeMW2xyg/UKzVb5ouktT/omshSc7o9NFX+LXnU4sFzEno+OqalF8Wcjzp61T538S1dfYQiTRZj+mjqys= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156884; c=relaxed/simple; bh=gL/jSOB0IbbkKEWXnBLxDu/IoPYG7MLarbK3QKLi5pE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=mzqNpnX9spT9Xu23Rsn6YjQTs2zQBVpmTKn9Ci2NHuLkCbOmoY7CoZUx2EQ238p/Vq6IiMJn42nU6ILG/o3OkP8oHkk90ragOcBS12X9+8ZqVH4tfwjJ/itUg36+FQLQxMNMkCvsCmPQLAw7QzrkxyRNx4Sv31vc7ALaQ6pyjtI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=XHVRawJy; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=b+dRIHjv; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="XHVRawJy"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="b+dRIHjv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156880; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=XHVRawJylqxvDOjCcZhFUCSEHa64j5QQ5ZNazrVHJHitD7XXoCpoIOL0ubbYDzprsy6omZ WVZhURm/yf0EHhTMd1mfV49z7d6xxugBFb9CpUW+c0kZZ8SVrq5noU+fQu2L8e3BlU8s0W 4E2fY9C3Lva+vixuYg6PdixVoTK7I3w= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-640-EDIihSi3Pw6ym-69dD1Zaw-1; Thu, 07 May 2026 08:27:57 -0400 X-MC-Unique: EDIihSi3Pw6ym-69dD1Zaw-1 X-Mimecast-MFC-AGG-ID: EDIihSi3Pw6ym-69dD1Zaw_1778156876 Received: by mail-ej1-f71.google.com with SMTP id a640c23a62f3a-b844098869cso65868366b.2 for ; Thu, 07 May 2026 05:27:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156876; x=1778761676; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=b+dRIHjvscUBwpM9ZZcfoiTA27dJXrO7Hv4+bl5hyKnu/iCFPgnVzpHMJ0JVF37F8n Fo/zj/kQ3wJpVuJ8bw27OIDK7todzLfzPpOM03wZgslNnRFnuJP0RvtNOcA8UHb6kES6 PvpyLgH++s/iZx5V0XByPFv0RYCXojJbagnygkUFfZykDLZ+SE9hBKEKx8ruINBm582U 8PFauPvky1kchH6w9+S0P+t1DVdmbwe41nQWHBTO1SmoMhuVeS6Tdx5LR7KoqMRgG2/E 7+/fuJZkobnI+7+vh8gE4nFjmzAeDc1mVwxyTnVdIZh8Idc16RPeckaJpupYTZgR+VVo 9l2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156876; x=1778761676; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=IK38ExyxBFwAa6yGgmqv7MY2AJDMqxLwulOHrCGERaJU4SRD5OOx6yJZAcwbUTGtuG zdBFnKN+VyIohI9cm/OPA84kHqome6tHBDlHYlaF53YCofKQctHkZo2G1HX9LTtqcDul 6s3GTk8xOhwZI0G+KfYoZfqMDq0/ndSuam7Wq5UyYTp1B4vBqtV+pu+RxyAvJG0eyA68 NrttPdU5FhCQ4WHKJiy8Jti+fk+Q3rBE509XP2C41xGr41D37sNoIvqttDY6fWlEq9py 6ll80Vkxd2COUFBos5Zdx/Y83H4T6l9ST9lO/wB06BXmLF/fOqx2H6+HR1SYCXMXLrFR jGhw== X-Gm-Message-State: AOJu0YwvHVACUQNX4eGpykbDQGJvd4UPjuxrXQLeG7zrkD9Wmw2LtiFG RDwFSJT39Z0kW+G3/g6sf+F4HWhPXrFDt0ltPFkmafBn55gEz6xZsPqfqKFvVErXFsCha201K83 wm6uO+egq61AzBFd8coMLVHGLBZXkcv/tNdKN7F+vb4meruLJ9hEjj35JECmNgFCpEw== X-Gm-Gg: AeBDietiiApCHLNrOi4FGxcvWoXvzZAqgkbHtqht4YNHV4vDXVv1GntsD833/4IC9F6 GCNrRsDHQvzVwDnyo7nLbuNRm2iOjvNpo8wvZWS/bHAQKLMHMsQ+te/b/xlxANQu2desxxxHMY7 UxF1w+SwgVYPAlh0L/pQMvtA7lubSGkgCgpgJ/C2qtIIqhPnq1nbp8cJ4XtP6mk8eLclp4RbpGV nfS5sMcfO/pbl7hYELDxBFnj4S0i6ZN64c3L/Vb48xsHeA89JLkXbcznpGFF/sH0rY2I6Y1DeDS PfvOq5IONCjNt2KwoiIH19fl/YFmbgooNMeITrejZLjYtg8+P31MZBKMqHVSf6ux5yUEZ7aciDJ eCaG1Ibfeedhbyl6XBBlY8BuetXTGI2vd5HvvOvpOp9CSQ/EIcvRyIRqtMuxh1Yjw09G+Gva+Tb B/ X-Received: by 2002:a17:907:26c4:b0:baf:19a8:b44b with SMTP id a640c23a62f3a-bc56cd38ef7mr474043866b.25.1778156875713; Thu, 07 May 2026 05:27:55 -0700 (PDT) X-Received: by 2002:a17:907:26c4:b0:baf:19a8:b44b with SMTP id a640c23a62f3a-bc56cd38ef7mr474040466b.25.1778156874932; Thu, 07 May 2026 05:27:54 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:54 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 08/11] selftests: ceph: add reset stress test Date: Thu, 7 May 2026 12:27:34 +0000 Message-Id: <20260507122737.2804094-9-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a single-client stress test for the CephFS manual session reset feature. The test runs concurrent I/O workers alongside periodic reset injection, then validates data integrity via validate_consistency.py. Supports four profiles (baseline, moderate, aggressive, soak) with configurable duration, reset interval, and worker counts. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- .../filesystems/ceph/reset_stress.sh | 694 ++++++++++++++++++ 1 file changed, 694 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/too= ls/testing/selftests/filesystems/ceph/reset_stress.sh new file mode 100755 index 000000000000..c503c75a5f7a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh @@ -0,0 +1,694 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS reset stress test: +# - Runs concurrent I/O and rename workloads +# - Triggers random client resets through debugfs +# - Validates consistency and recovery behavior + +set -euo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +PROFILE=3D"moderate" +DURATION_SEC=3D"" +COOLDOWN_SEC=3D20 +FILE_COUNT=3D64 +IO_WORKERS=3D"" +RENAME_WORKERS=3D"" +MOUNT_POINT=3D"" +OUT_DIR=3D"" +CLIENT_ID=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +SLO_SECONDS=3D30 +EXPECT_RESET=3D1 +DMESG_CMD=3D"" +SUDO=3D"" + +RESET_MIN_SEC=3D5 +RESET_MAX_SEC=3D15 + +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +WORKLOAD_FLAG=3D"" +RESET_FLAG=3D"" +DATA_DIR=3D"" + +IO_LOG=3D"" +RENAME_LOG=3D"" +RESET_LOG=3D"" +STATUS_LOG=3D"" +STATUS_BEFORE=3D"" +STATUS_FINAL=3D"" +DMESG_LOG=3D"" +SUMMARY_LOG=3D"" +REPORT_JSON=3D"" + +RESET_PID=3D0 +STATUS_PID=3D0 +declare -a IO_WORKER_PIDS=3D() +declare -a RENAME_WORKER_PIDS=3D() + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point to test under + +Options: + --profile NAME baseline|moderate|aggressive|soak (default: mod= erate) + --duration-sec N Override profile runtime in seconds + --cooldown-sec N Workload drain time after injector stop (defaul= t: 20) + --file-count N Number of logical files (default: 64) + --io-workers N Number of concurrent I/O workers (profile defau= lt) + --rename-workers N Number of concurrent rename workers (profile de= fault) + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_st= ress_) + --client-id ID Ceph debugfs client id; auto-detect if one clie= nt exists + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/c= eph) + --slo-seconds N Max allowed post-reset stall window (default: 3= 0) + --no-reset Disable reset injector (baseline mode helper) + --help Show this message + +Examples: + $0 --mount-point /mnt/cephfs --profile moderate + $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300 + $0 --mount-point /mnt/cephfs --profile baseline --no-reset +EOF +} + +now_ms() +{ + date +%s%3N +} + +set_profile_defaults() +{ + case "$PROFILE" in + baseline) + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + EXPECT_RESET=3D0 + : "${DURATION_SEC:=3D600}" + : "${IO_WORKERS:=3D1}" + : "${RENAME_WORKERS:=3D1}" + ;; + moderate) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + aggressive) + RESET_MIN_SEC=3D1 + RESET_MAX_SEC=3D5 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D4}" + : "${RENAME_WORKERS:=3D2}" + ;; + soak) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D3600}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + *) + echo "Unknown profile: $PROFILE" >&2 + exit 2 + ;; + esac +} + +log_summary() +{ + local msg=3D"$1" + printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUM= MARY_LOG" +} + +discover_client_id() +{ + local candidates=3D() + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then + echo "SKIP: reset debugfs not found for client-id=3D$CLIENT_ID" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + if ! $SUDO test -d "$DEBUGFS_ROOT"; then + echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2 + exit "$KSFT_SKIP" + fi + + while IFS=3D read -r entry; do + $SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue + $SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue + candidates+=3D("$entry") + done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true) + + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + return 0 + fi + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-= id." >&2 + exit "$KSFT_SKIP" +} + +init_dataset() +{ + local i + mkdir -p "$DATA_DIR/A" "$DATA_DIR/B" + + for ((i =3D 0; i < FILE_COUNT; i++)); do + printf 'seed logical_id=3D%05d ts_ms=3D%s\n' "$i" "$(now_ms)" > "$DATA_D= IR/A/file_$(printf '%05d' "$i")" + done +} + +io_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local relpath + local abspath + local payload + local hash + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + relpath=3D"A/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + relpath=3D"B/file_$id" + else + sleep 0.02 + continue + fi + + abspath=3D"$DATA_DIR/$relpath" + alt_relpath=3D"" + if [[ "$relpath" =3D=3D A/* ]]; then + alt_relpath=3D"B/file_$id" + else + alt_relpath=3D"A/file_$id" + fi + alt_abspath=3D"$DATA_DIR/$alt_relpath" + payload=3D"worker=3D${worker_id} io_seq=3D${seq} id=3D${id} ts_ms=3D$(no= w_ms)" + result=3D"$( + python3 - "$abspath" "$alt_abspath" "$payload" <<'PY' +import hashlib +import os +import sys + +path =3D sys.argv[1] +alt_path =3D sys.argv[2] +payload =3D sys.argv[3] + +try: + fd =3D os.open(path, os.O_RDWR | os.O_APPEND) + actual =3D path +except FileNotFoundError: + try: + fd =3D os.open(alt_path, os.O_RDWR | os.O_APPEND) + actual =3D alt_path + except FileNotFoundError: + sys.exit(1) + +try: + os.write(fd, (payload + "\n").encode()) + os.fsync(fd) + os.lseek(fd, 0, os.SEEK_SET) + digest =3D hashlib.sha256() + while True: + chunk =3D os.read(fd, 1 << 20) + if not chunk: + break + digest.update(chunk) + print(actual + " " + digest.hexdigest()) +finally: + os.close(fd) +PY + )" || { + sleep 0.02 + continue + } + + actual_abspath=3D"${result%% *}" + hash=3D"${result#* }" + if [[ "$actual_abspath" =3D=3D "$alt_abspath" ]]; then + relpath=3D"$alt_relpath" + fi + + ts=3D"$(now_ms)" + printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_= LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +rename_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local src_rel + local dst_rel + local rc + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + src_rel=3D"A/file_$id" + dst_rel=3D"B/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + src_rel=3D"B/file_$id" + dst_rel=3D"A/file_$id" + else + sleep 0.02 + continue + fi + + ts=3D"$(now_ms)" + if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_re= l" "$dst_rel" "$rc" >> "$RENAME_LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +random_sleep_seconds() +{ + local min_sec=3D"$1" + local max_sec=3D"$2" + local wait_sec + local span + + span=3D$((max_sec - min_sec + 1)) + wait_sec=3D$((min_sec + RANDOM % span)) + sleep "$wait_sec" +} + +reset_injector() +{ + set +e + local trigger_path=3D"$1" + local seq=3D0 + local ts + local reason + local rc + + while [[ -f "$RESET_FLAG" ]]; do + random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC" + [[ -f "$RESET_FLAG" ]] || break + + ts=3D"$(now_ms)" + reason=3D"stress_${seq}_${ts}" + if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG" + seq=3D$((seq + 1)) + done +} + +status_sampler() +{ + set +e + local status_path=3D"$1" + local ts + local kv_line + + while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do + ts=3D"$(now_ms)" + if $SUDO test -r "$status_path"; then + kv_line=3D"$($SUDO awk -F': ' 'NF>=3D2 {gsub(/ /, "", $1); gsub(/ /, ""= , $2); printf "%s=3D%s;", $1, $2}' "$status_path")" + printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG" + fi + sleep 1 + done +} + +stop_pid_with_timeout() +{ + local pid=3D"$1" + local name=3D"$2" + local timeout=3D"$3" + local waited=3D0 + + if [[ "$pid" -le 0 ]]; then + return 0 + fi + + while kill -0 "$pid" 2>/dev/null; do + if (( waited >=3D timeout )); then + log_summary "Timeout waiting for $name (pid=3D$pid), sending SIGTERM/SI= GKILL" + kill -TERM "$pid" 2>/dev/null || true + sleep 1 + kill -KILL "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + return 1 + fi + sleep 1 + waited=3D$((waited + 1)) + done + + wait "$pid" 2>/dev/null || true + return 0 +} + +detect_privileges() +{ + if [[ -r "$DEBUGFS_ROOT" ]]; then + SUDO=3D"" + elif sudo -n true 2>/dev/null; then + SUDO=3D"sudo" + else + echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is no= t available" >&2 + echo "WARNING: reset injection, debugfs status checks, and dmesg capture= will not work" >&2 + fi + + if $SUDO dmesg > /dev/null 2>&1; then + DMESG_CMD=3D"$SUDO dmesg" + else + DMESG_CMD=3D"" + echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will = not be detected" >&2 + fi +} + +check_dmesg() +{ + local start_epoch=3D"$1" + + if [[ -z "$DMESG_CMD" ]]; then + return 0 + fi + + if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then + if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then + log_summary "WARNING: dmesg capture failed unexpectedly" + return 0 + fi + log_summary "dmesg --since unsupported; captured full dmesg" + fi + + if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then + log_summary "ERROR: kernel log contains 'hung task' during test window" + return 1 + fi + + return 0 +} + +cleanup() +{ + rm -f "$WORKLOAD_FLAG" "$RESET_FLAG" + local pid + for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID"= "$STATUS_PID"; do + [[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true + done + wait 2>/dev/null || true +} + +parse_args() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) + MOUNT_POINT=3D"$2" + shift 2 + ;; + --profile) + PROFILE=3D"$2" + shift 2 + ;; + --duration-sec) + DURATION_SEC=3D"$2" + shift 2 + ;; + --cooldown-sec) + COOLDOWN_SEC=3D"$2" + shift 2 + ;; + --file-count) + FILE_COUNT=3D"$2" + shift 2 + ;; + --io-workers) + IO_WORKERS=3D"$2" + shift 2 + ;; + --rename-workers) + RENAME_WORKERS=3D"$2" + shift 2 + ;; + --out-dir) + OUT_DIR=3D"$2" + shift 2 + ;; + --client-id) + CLIENT_ID=3D"$2" + shift 2 + ;; + --debugfs-root) + DEBUGFS_ROOT=3D"$2" + shift 2 + ;; + --slo-seconds) + SLO_SECONDS=3D"$2" + shift 2 + ;; + --no-reset) + EXPECT_RESET=3D0 + shift + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown option: $1" >&2 + usage + exit 2 + ;; + esac + done +} + +main() +{ + local start_epoch + local trigger_path=3D"" + local status_path=3D"" + local final_rc=3D0 + local reset_enabled=3D0 + local i + + parse_args "$@" + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + rm -f "$MOUNT_POINT/.ceph_reset_test_probe" + + if ! command -v python3 > /dev/null 2>&1; then + echo "SKIP: python3 is required but not found in PATH" >&2 + exit "$KSFT_SKIP" + fi + + if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then + echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2 + fi + + detect_privileges + + set_profile_defaults + if [[ "$EXPECT_RESET" -eq 0 ]]; then + PROFILE=3D"baseline" + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + fi + + if ! [[ "$IO_WORKERS" =3D~ ^[0-9]+$ && "$RENAME_WORKERS" =3D~ ^[0-9]+$ ]]= ; then + echo "io-workers and rename-workers must be integers" >&2 + exit 2 + fi + + if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then + echo "io-workers and rename-workers must be > 0" >&2 + exit 2 + fi + + if [[ -z "$OUT_DIR" ]]; then + OUT_DIR=3D"/tmp/ceph_reset_stress_${RUN_ID}" + fi + mkdir -p "$OUT_DIR" + + WORKLOAD_FLAG=3D"$OUT_DIR/workload.running" + RESET_FLAG=3D"$OUT_DIR/reset.running" + + DATA_DIR=3D"$MOUNT_POINT/ceph_reset_stress_${RUN_ID}" + mkdir -p "$DATA_DIR" + + IO_LOG=3D"$OUT_DIR/io.log" + RENAME_LOG=3D"$OUT_DIR/rename.log" + RESET_LOG=3D"$OUT_DIR/reset.log" + STATUS_LOG=3D"$OUT_DIR/status.log" + STATUS_BEFORE=3D"$OUT_DIR/reset_status.before" + STATUS_FINAL=3D"$OUT_DIR/reset_status.final" + DMESG_LOG=3D"$OUT_DIR/dmesg.log" + SUMMARY_LOG=3D"$OUT_DIR/summary.log" + REPORT_JSON=3D"$OUT_DIR/validator_report.json" + + : > "$IO_LOG" + : > "$RENAME_LOG" + : > "$RESET_LOG" + : > "$STATUS_LOG" + : > "$SUMMARY_LOG" + + start_epoch=3D"$(date +%s)" + + log_summary "Starting Ceph reset stress test" + log_summary "Profile=3D$PROFILE duration=3D${DURATION_SEC}s cooldown=3D${= COOLDOWN_SEC}s file_count=3D${FILE_COUNT} io_workers=3D${IO_WORKERS} rename= _workers=3D${RENAME_WORKERS}" + [[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations" + [[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung = task detection disabled" + log_summary "Artifacts=3D$OUT_DIR" + log_summary "Data dir=3D$DATA_DIR" + + init_dataset + + if [[ "$EXPECT_RESET" -eq 1 ]]; then + discover_client_id + trigger_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger" + status_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + if ! $SUDO test -w "$trigger_path"; then + echo "SKIP: Reset trigger is not writable: $trigger_path" >&2 + exit "$KSFT_SKIP" + fi + if ! $SUDO test -r "$status_path"; then + echo "SKIP: Reset status is not readable: $status_path" >&2 + exit "$KSFT_SKIP" + fi + $SUDO cat "$status_path" > "$STATUS_BEFORE" || true + reset_enabled=3D1 + log_summary "Using ceph client id: $CLIENT_ID" + fi + + trap cleanup EXIT INT TERM + + touch "$WORKLOAD_FLAG" + for ((i =3D 0; i < IO_WORKERS; i++)); do + io_worker "$i" & + IO_WORKER_PIDS+=3D("$!") + done + + for ((i =3D 0; i < RENAME_WORKERS; i++)); do + rename_worker "$i" & + RENAME_WORKER_PIDS+=3D("$!") + done + + if [[ "$reset_enabled" -eq 1 ]]; then + touch "$RESET_FLAG" + reset_injector "$trigger_path" & + RESET_PID=3D$! + + status_sampler "$status_path" & + STATUS_PID=3D$! + fi + + sleep "$DURATION_SEC" + + if [[ "$reset_enabled" -eq 1 ]]; then + rm -f "$RESET_FLAG" + stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=3D1 + log_summary "Injector stopped; entering cooldown=3D${COOLDOWN_SEC}s" + fi + + sleep "$COOLDOWN_SEC" + + rm -f "$WORKLOAD_FLAG" + for i in "${!IO_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || fina= l_rc=3D1 + done + for i in "${!RENAME_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20= || final_rc=3D1 + done + + if [[ "$reset_enabled" -eq 1 ]]; then + stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=3D1 + $SUDO cat "$status_path" > "$STATUS_FINAL" || true + fi + + if ! check_dmesg "$start_epoch"; then + final_rc=3D1 + fi + + if ! python3 "$SCRIPT_DIR/validate_consistency.py" \ + --data-dir "$DATA_DIR" \ + --file-count "$FILE_COUNT" \ + --io-log "$IO_LOG" \ + --rename-log "$RENAME_LOG" \ + --reset-log "$RESET_LOG" \ + --status-final "$STATUS_FINAL" \ + --slo-seconds "$SLO_SECONDS" \ + --report-json "$REPORT_JSON" \ + $( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then + final_rc=3D1 + fi + + if [[ "$final_rc" -eq 0 ]]; then + log_summary "PASS: stress run completed successfully" + else + log_summary "FAIL: stress run detected one or more failures" + fi + + log_summary "Artifacts available in: $OUT_DIR" + exit "$final_rc" +} + +main "$@" --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5040E3F075B for ; Thu, 7 May 2026 12:28:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156884; cv=none; b=r/Ym8PFwCp7bq8arsxzO2m89XvybhUytUyTnQe3X2npODwqFDWJ+vsjAnlY3KJgTpfYSlPCVGcRVot8+oGE5E4g6UVU/nRs3SP/3iniw3Ulu/KdGm/ra++B3+5nNAFsCVNIDMiiHswZjjtYDXZIOcdyQlDm16SUHcfs3XXm1vlo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156884; c=relaxed/simple; bh=Ds+NWchAxLST5EnvxjghWpSx2cCAXjiNQobyoGIaRDs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=t/X565AMKFLiBwfgCm9If1LMCeRmt+KI6k8S27hWKRnLsS0bWUxpvpHhT3s+GACmf8hgJUTfWX8e+FruOwIo/E9DwdEbDTfMWVILs10vkz4L+wt1pBZXxaljMQLiidryK96sWug6NReTZm5kdXnNFj9pjetztSFUFeKSC2nJPzk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=W1LpsHcp; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=rhEluNJj; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="W1LpsHcp"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="rhEluNJj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156880; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ob0h8gSVGLqT/7328u/2PbJUBEL3c1K2zmCjhJRA6LA=; b=W1LpsHcpVqiAfi3uqaJCELFgFtcTp5Gvru3ir5HI69U1uz7yLF3q/XkbfdT0J/vrhMOl/w 3hSYKz5ZR/wS9lfF9kQRGgER9gaXusKhXmHyf9Is02wCmTGCFcJoanrFLrGXJwryu3y3/j O68B8w/AXN0X9LvLTB+utnbJTjOKAJw= Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-393-ek6hmC1gOEWVg-6UJyVAHw-1; Thu, 07 May 2026 08:27:58 -0400 X-MC-Unique: ek6hmC1gOEWVg-6UJyVAHw-1 X-Mimecast-MFC-AGG-ID: ek6hmC1gOEWVg-6UJyVAHw_1778156877 Received: by mail-ed1-f72.google.com with SMTP id 4fb4d7f45d1cf-6794b459297so652368a12.1 for ; Thu, 07 May 2026 05:27:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156877; x=1778761677; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ob0h8gSVGLqT/7328u/2PbJUBEL3c1K2zmCjhJRA6LA=; b=rhEluNJjpSsi3MFJnGOyKGZB/ch1fGk4+nGA8nvVlR9tHRWBZUMDlDB9p6RQMhRnD0 luVsDxqTTeJEc85c9lASOMriVYgpzikhIzk5QrvUpCj2VaY5INUH40W5U4D0PfXLX300 umM9b6gh10G8u0PYjLATbj7JgRLRcc5Gf5v4siCyA/9BLzro5TN1uiBr36+7RRf5/B7P ksiLflHLZETKjoEbv9KOzP81P1z/ErMAQ1s7BztGETbL/TEkz2pxqalp/i91rwKEJ8OK fNOFQU2D3TXFAWt6ZXqlZeX5/beQ7HaA4Y/3t8b5blDaGDarDfEpzC0CXV48WmVNEb0+ kQng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156877; x=1778761677; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ob0h8gSVGLqT/7328u/2PbJUBEL3c1K2zmCjhJRA6LA=; b=afqxYzjVNPw2cTKeOBK5gHD1jcDDThN62NbFI8qyDkbHKYqY8rdgesyBFUbgw8lFnQ 79jmR2681OAuccMTkotVuqk7xLe2/GagbFiNzdq5UbD8xF3HROIvoQ/Ns3crb6l0MzJy oi7X5Yat2Z2v38U8gK2hGWZ8slB/oScVBNSWgVfHlMTYrowmSUHfHkccu2s+qKnuO5uG hGdm+8Cl+0fwtBSQvRzimBa6wjRJZf7fxc6gv0ieQahucczKPkQlrZAJ2tnACYrA7gyl Wpz/rsTaxlBt35gI0putyOK+wJC1RMmMykCBkpojdJYkdEzg3RiMvdEGuNaWd4N7hvEH Adkg== X-Gm-Message-State: AOJu0Yw/7lSk55E2K/1yy2JiIpCCa6vbfW+TWwkC0ds/qeYCc5QoY6mb FxPsmavfroRJkv11EwqrSf7sCwnNTdWIppulIF3GoUVLVO+N8nTy3rcrhYuPmuV3BshHNhoWhf3 41LsOUj5UJAseGVdn1iFkQW98p8U3xYaLqZcEpxfvrToP2l8pqK2o9yqNh0c97Gg53w== X-Gm-Gg: AeBDieuW+eNHeyP6quIsoEoA2GyoITv2+4kVz7HufQzTNApA6HtVTs9jxyve1TYfcpD k5xaZV0dQhQcWd9mWUPZZPMXA5PF3+sK8OFmHPhvmfD0uZwsg05T9fAUTKgyFjEi+c85up8a/Re q1H+MOrry2GS8ILsYvLQQaQu61MLkXsSPfZKweUPbGcaoS+eccQ3uxxFu8q+NDS8M08OWdBBKIt pvd0P2qv1R1pbU/PTkoKC7Wm5UrxnVwR+qkG+rxznvmwRvNpCBAVflOz92JM8j/wpZtvTiwgzFL +uaedWJKWtKUZPHno9F4P/118U1qAmSiQv5MQRruOvwoRSDgIaFB+XuMZnPprZazomBYms+RgaJ OW8kTJaofWQYcATwEMUOj1KmfRo0prjd69DuqqX9OeoYyWjY+/qhqgFpUa3bIMA8T1sGakMDNi/ n6 X-Received: by 2002:a17:907:e143:b0:bc6:41f0:9b2f with SMTP id a640c23a62f3a-bc641f09bf6mr218282066b.15.1778156876816; Thu, 07 May 2026 05:27:56 -0700 (PDT) X-Received: by 2002:a17:907:e143:b0:bc6:41f0:9b2f with SMTP id a640c23a62f3a-bc641f09bf6mr218279366b.15.1778156875946; Thu, 07 May 2026 05:27:55 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:55 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Date: Thu, 7 May 2026 12:27:35 +0000 Message-Id: <20260507122737.2804094-10-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add targeted corner-case tests for the CephFS manual session reset feature. Four sequential tests cover: [1/4] ebusy_rejection - second reset rejected while first in-flight [2/4] dirty_caps_at_reset - reset with unflushed dirty caps [3/4] flock_after_reset - stale lock EIO + fresh lock after holder ex= it [4/4] unmount_during_reset - umount during active reset (ESHUTDOWN path) Requires: mounted CephFS, debugfs access (root), flock(1) utility. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++++++ 1 file changed, 646 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_c= ases.sh diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh= b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh new file mode 100755 index 000000000000..a6dae84a616d --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh @@ -0,0 +1,646 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset corner case tests. +# Runs a checklist of targeted tests that exercise specific reset +# code paths not covered by the stress tests. +# +# Requires: mounted CephFS, debugfs access (root), flock(1) utility. + +set -uo pipefail + +KSFT_SKIP=3D4 + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +DEBUGFS_CLIENT=3D"" +TRIGGER_PATH=3D"" +STATUS_PATH=3D"" +TEMP_MNT=3D"" + +PASS_COUNT=3D0 +FAIL_COUNT=3D0 +SKIP_COUNT=3D0 +TOTAL=3D4 + +log() +{ + printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1" +} + +result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"${4:-}" + + case "$status" in + PASS) PASS_COUNT=3D$((PASS_COUNT + 1)) ;; + FAIL) FAIL_COUNT=3D$((FAIL_COUNT + 1)) ;; + SKIP) SKIP_COUNT=3D$((SKIP_COUNT + 1)) ;; + esac + + if [[ -n "$detail" ]]; then + printf '[%d/%d] %-30s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$de= tail" + else + printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status" + fi +} + +read_status_field() +{ + local field=3D"$1" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$STATUS_PATH" 2>= /dev/null +} + +wait_reset_done() +{ + local timeout=3D"${1:-30}" + local elapsed=3D0 + + while [[ "$(read_status_field "phase")" !=3D "idle" ]]; do + sleep 1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge "$timeout" ]]; then + return 1 + fi + done + return 0 +} + +list_reset_clients() +{ + local entry + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + printf '%s\n' "$entry" + done +} + +wait_status_nonidle() +{ + local status_path=3D"$1" + local timeout=3D"${2:-10}" + local polls=3D$((timeout * 10)) + local phase + + while [[ "$polls" -gt 0 ]]; do + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$status_path" 2>/d= ev/null)" + if [[ -n "$phase" && "$phase" !=3D "idle" ]]; then + return 0 + fi + sleep 0.1 + polls=3D$((polls - 1)) + done + + return 1 +} + +discover_debugfs() +{ + local candidates=3D() + local entry + + if [[ -n "$DEBUGFS_CLIENT" ]]; then + if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then + echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + candidates+=3D("$entry") + done + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + if [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-= id." >&2 + exit "$KSFT_SKIP" + fi + + DEBUGFS_CLIENT=3D"${candidates[0]}" +} + +# --- Test 1: ebusy_rejection --------------------------------------------= ---- +# +# Trigger a reset while another is guaranteed in-flight. Creates +# dirty state so the first reset enters DRAINING (which takes +# measurable time), then polls until phase !=3D idle and issues the +# second trigger. The second trigger must fail (the kernel returns +# -EBUSY), and only one reset must be counted in the accounting. + +test_ebusy_rejection() +{ + local num=3D1 + local name=3D"ebusy_rejection" + local testfile=3D"$MOUNT_POINT/.reset_corner_ebusy_$$" + local tc_before tc_after sc_before sc_after second_rc phase elapsed + + tc_before=3D"$(read_status_field "trigger_count")" + sc_before=3D"$(read_status_field "success_count")" + + # Create dirty state so the first reset enters DRAINING + echo "ebusy_dirty_data" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_ebusy_test\n') +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + # Trigger the first reset -- it will drain dirty state + echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "first trigger failed" + rm -f "$testfile" + return + } + + # Poll until phase is non-idle (quiescing or draining) + elapsed=3D0 + while true; do + phase=3D"$(read_status_field "phase")" + if [[ "$phase" !=3D "idle" ]]; then + break + fi + sleep 0.1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge 50 ]]; then + result "$num" "$name" SKIP \ + "first reset completed before overlap could be tested" + rm -f "$testfile" 2>/dev/null + return + fi + done + + # Issue the second trigger -- should be rejected with EBUSY + second_rc=3D0 + echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=3D0 || sec= ond_rc=3D$? + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "first reset never completed" + rm -f "$testfile" + return + fi + + tc_after=3D"$(read_status_field "trigger_count")" + sc_after=3D"$(read_status_field "success_count")" + + if [[ "$((tc_after - tc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$((sc_after - sc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$second_rc" -eq 0 ]]; then + result "$num" "$name" FAIL "second trigger did not return error" + rm -f "$testfile" + return + fi + + rm -f "$testfile" 2>/dev/null + result "$num" "$name" PASS +} + +# --- Test 2: dirty_caps_at_reset ----------------------------------------= ---- +# +# Write to a file without fsync (dirty caps), trigger reset, then +# verify the file is not corrupt. Manual reset drains dirty caps +# before teardown (best-effort, 5s timeout). For a non-stuck cap +# the dirty write should be flushed during drain and persist. +# If the drain window is too short, only the synced first line +# persists -- that is acceptable (data loss is documented for +# unflushed writes). + +test_dirty_caps_at_reset() +{ + local num=3D2 + local name=3D"dirty_caps_at_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_dirty_caps_$$" + local content_after line_count sc_before sc_after le + + sc_before=3D"$(read_status_field "success_count")" + + echo "line_1_before_dirty_write" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'line_2_dirty_no_fsync\n') +# deliberately no fsync -- leave caps dirty +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + result "$num" "$name" FAIL "success_count did not increment (reset not e= xercised)" + rm -f "$testfile" + return + fi + + sync "$testfile" 2>/dev/null || true + content_after=3D"$(cat "$testfile" 2>/dev/null)" || { + result "$num" "$name" FAIL "cannot read file after reset" + rm -f "$testfile" + return + } + + if [[ -z "$content_after" ]]; then + result "$num" "$name" FAIL "file is empty after reset" + rm -f "$testfile" + return + fi + + line_count=3D"$(echo "$content_after" | wc -l)" + if [[ "$line_count" -lt 1 ]]; then + result "$num" "$name" FAIL "file has $line_count lines, expected >=3D 1" + rm -f "$testfile" + return + fi + + echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || { + result "$num" "$name" FAIL "first line corrupted" + rm -f "$testfile" + return + } + + le=3D"$(read_status_field "last_errno")" + if [[ "$le" !=3D "0" ]]; then + result "$num" "$name" FAIL "last_errno=3D$le, expected 0" + rm -f "$testfile" + return + fi + + rm -f "$testfile" + result "$num" "$name" PASS "file intact ($line_count lines)" +} + +# --- Test 3: flock_after_reset ------------------------------------------= ---- +# +# Take an exclusive flock, trigger reset, verify stale lock state is +# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns +# EIO). After the original holder exits (releasing the local lock +# reference and clearing the error flag), a fresh lock can be acquired. +# +# The lock holder uses the fd-based flock form with exec, so killing +# $lock_pid closes the lock fd immediately (no orphaned child with an +# inherited fd copy that would prevent the VFS flock release). + +test_flock_after_reset() +{ + local num=3D3 + local name=3D"flock_after_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_flock_$$" + local lock_pid probe_rc sc_before sc_after + + sc_before=3D"$(read_status_field "success_count")" + + echo "flock_test_content" > "$testfile" + sync "$testfile" + + # Hold lock via fd in a subshell; exec ensures killing $lock_pid + # closes the lock fd directly (no fork/child fd inheritance). + ( + exec 9<"$testfile" + flock --exclusive --nonblock 9 || exit 1 + exec sleep 300 + ) & + lock_pid=3D$! + sleep 0.5 + + if ! kill -0 "$lock_pid" 2>/dev/null; then + result "$num" "$name" FAIL "flock holder died immediately" + rm -f "$testfile" + return + fi + + echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || { + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "success_count did not increment" + rm -f "$testfile" + return + fi + + # After teardown, CEPH_I_ERROR_FILELOCK is set on the inode. + # A same-client lock attempt should fail (EIO), NOT succeed. + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=3D0= || probe_rc=3D$? + if [[ "$probe_rc" -eq 0 ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL \ + "same-client probe succeeded, expected EIO from stale lock state" + rm -f "$testfile" + return + fi + + # Kill the holder -- the exec'd sleep IS $lock_pid, so killing it + # closes fd 9 directly. VFS flock release fires ceph_fl_release_lock(), + # which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK. + kill "$lock_pid" 2>/dev/null + wait "$lock_pid" 2>/dev/null + + # After the holder exits, a fresh lock should be acquirable. + # The reset teardown sends SESSION_REQUEST_CLOSE so the MDS + # releases locks promptly, but retry briefly in case the + # message races with the connection close. + local attempt + probe_rc=3D1 + for attempt in 1 2 3 4 5; do + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null \ + && probe_rc=3D0 || probe_rc=3D$? + [[ "$probe_rc" -eq 0 ]] && break + sleep 1 + done + if [[ "$probe_rc" -ne 0 ]]; then + result "$num" "$name" FAIL \ + "cannot acquire fresh lock after holder exit (rc=3D$probe_rc, ${attempt= } attempts)" + rm -f "$testfile" + return + fi + + # Verify file content survived + grep -q "flock_test_content" "$testfile" 2>/dev/null || { + result "$num" "$name" FAIL "file content corrupted after reset" + rm -f "$testfile" + return + } + + rm -f "$testfile" + result "$num" "$name" PASS "stale lock detected, fresh lock acquired afte= r holder exit" +} + +# --- Test 4: unmount_during_reset ---------------------------------------= ---- +# +# Mount a fresh CephFS, trigger reset, immediately unmount. The +# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN +# and not hang. + +test_unmount_during_reset() +{ + local num=3D4 + local name=3D"unmount_during_reset" + local temp_mnt=3D"/tmp/ceph_corner_mnt_$$" + local mount_opts=3D"" + local mount_src=3D"" + local temp_trigger=3D"" + local temp_status=3D"" + local temp_client=3D"" + local temp_file=3D"$temp_mnt/.reset_corner_umount_$$" + local phase=3D"" + local trigger_ok=3D0 + local attempt + local -a new_clients=3D() + declare -A existing_clients=3D() + + mount_src=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "cep= h" {print $1; exit}' /proc/mounts 2>/dev/null)" + mount_opts=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "ce= ph" {print $4; exit}' /proc/mounts 2>/dev/null)" + + if [[ -z "$mount_src" ]]; then + result "$num" "$name" SKIP "cannot determine mount source from /proc/mou= nts" + return + fi + + while IFS=3D read -r existing; do + [[ -n "$existing" ]] || continue + existing_clients["$existing"]=3D1 + done < <(list_reset_clients) + + mkdir -p "$temp_mnt" + + if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null;= then + result "$num" "$name" SKIP "cannot mount additional CephFS instance" + rmdir "$temp_mnt" 2>/dev/null + return + fi + + ls "$temp_mnt" > /dev/null 2>&1 + sync + sleep 1 + + for attempt in $(seq 1 50); do + new_clients=3D() + while IFS=3D read -r entry; do + [[ -n "$entry" ]] || continue + if [[ -n "${existing_clients[$entry]+x}" ]]; then + continue + fi + new_clients+=3D("$entry") + done < <(list_reset_clients) + + if [[ "${#new_clients[@]}" -eq 1 ]]; then + temp_client=3D"${new_clients[0]}" + break + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + break + fi + + sleep 0.1 + done + + if [[ -z "$temp_client" ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "cannot identify debugfs client for temp moun= t" + return + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "multiple new debugfs clients appeared" + return + fi + + temp_trigger=3D"$DEBUGFS_ROOT/$temp_client/reset/trigger" + temp_status=3D"$DEBUGFS_ROOT/$temp_client/reset/status" + + echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot create dirty state on temp mount" + return + } + sync "$temp_file" + python3 -c " +import os, sys +fd =3D os.open('$temp_file', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_umount_test\\n') +os.close(fd) +" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap" + return + } + + echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=3D1 || tr= igger_ok=3D0 + if [[ "$trigger_ok" -ne 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot trigger reset on temp mount" + return + fi + + if ! wait_status_nonidle "$temp_status" 10; then + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$temp_status" 2>/d= ev/null)" + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL \ + "reset never became active before umount (phase=3D${phase:-unknown})" + return + fi + + local umount_ok=3D0 + timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=3D1 + + if [[ "$umount_ok" -ne 1 ]]; then + umount -l "$temp_mnt" 2>/dev/null || true + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "umount hung for >30s" + return + fi + + rmdir "$temp_mnt" 2>/dev/null + + ls "$MOUNT_POINT" > /dev/null 2>&1 || { + result "$num" "$name" FAIL "original mount unhealthy after test" + return + } + + result "$num" "$name" PASS +} + +# --- Main ---------------------------------------------------------------= ----- + +usage() +{ + cat < [--client-id ] [--debugfs-root ] + +Runs targeted corner-case tests for the CephFS client reset feature. +Requires root (debugfs access) and a mounted CephFS filesystem. + +Options: + --mount-point PATH CephFS mount point (required) + --client-id ID Ceph debugfs client id (auto-detect if one client) + --debugfs-root PATH Debugfs ceph root (default: /sys/kernel/debug/cep= h) + --help Show this message +EOF +} + +main() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --client-id) DEBUGFS_CLIENT=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac + done + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + discover_debugfs + TRIGGER_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger" + STATUS_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status" + + log "CephFS client reset corner case tests" + log "Mount: $MOUNT_POINT" + log "Client: $DEBUGFS_CLIENT" + echo "" + + test_ebusy_rejection + test_dirty_caps_at_reset + test_flock_after_reset + test_unmount_during_reset + + echo "" + echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skippe= d (of $TOTAL)" + + if [[ "$FAIL_COUNT" -gt 0 ]]; then + exit 1 + fi + exit 0 +} + +main "$@" --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B4B83F0AB7 for ; Thu, 7 May 2026 12:28:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156885; cv=none; b=p2LPhzCdccg+fgQb6DrOSsyiRNP5haaiovfvukfjpqdPlGB0DLlf5psYrb3JjcPbpr35mb58sqgjdVxRhofrKL7EMvEZeFGemJHfxBi17bgW5EHWhJ4nAHfYFFNQBcB8XWU20c8+IkEffZjQx/P9luYoa92b/C3VMXktC3891cQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156885; c=relaxed/simple; bh=tn7xy5Ayn+zIa11ASIUS5cpHLtLo4gvFlZuKOLg8ggQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=rkQ6wgxh6ZnHU1zrZdreP3HihIxIBoIY+eyjQyEtXecSUOYJ0FVpzxjEonexr8ZLzDPDSLcKi7cEs85BtFmI5SFuTJeg4AjDrGha8YUoQnHPDMlS9wQF64+Zb4RdjsIKrb1Ck1kvo8SIYwpr4iWA5EJUX3Y9hrw0DMZkD/Fu+qA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=AM1xmVAx; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=iuCnMkRA; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AM1xmVAx"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="iuCnMkRA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156882; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=AM1xmVAxWaeA1ss3jjPKcS9vEGCLW/mIIw/sbwjf2Lt28ryOSbIgs33thgKnyzdtaEsG0K Is3+bzLQFrtcPcqF2YylRXioDjx29duUzdMJaWABb4cmAq7ZauydKwEx35YsYKhemabzTk 7IjZ1Np2nRBtFbw7m79Ufep1x9NBt+k= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-662-rpnq_zmUN4yDloU027jwKg-1; Thu, 07 May 2026 08:27:59 -0400 X-MC-Unique: rpnq_zmUN4yDloU027jwKg-1 X-Mimecast-MFC-AGG-ID: rpnq_zmUN4yDloU027jwKg_1778156878 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-ba5fa04a96cso78579066b.3 for ; Thu, 07 May 2026 05:27:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156878; x=1778761678; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=iuCnMkRAYkjEqJN7gZNz2ifr00FXxbCnG1Ty+uKM24ZKayT9Plecx+ixCfT6alj2qf 3nAe8aoGJKSLvdKaDFmmMhuzFRwaoC4Rk20FCTgZNYQHAenvUYUqWJZh5tMr7O8R9Wvz zv9VoTwQvzHtPgjouH1HxBsSQg+S2EjGQu2t92vRDaPxcmcmcjU/qQKGQa0Dnpg8wNHG t8shZFTKgC60BHoHsN3BydgLI2awiajzzTEDFgrDNi4TpPHQcKKFs5MK+mOpGDsc1oVQ IZlt9VTXmoQD8slscJdDOStWSZDAf8+QVxBXx8ZoD09IqM5iwJDUBEd+16zVelVjXAc6 Xjqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156878; x=1778761678; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=qDPFdab8x5Axu8ltw4/gSynoR8SsPIL2WStEIBh7dIDr5VfzOBz2oMOoLwSRT74zLU evcsAaQ+wJ+wctGoB2vRnN29wZopP+NHiKJeiTP5cE/8EUBvscMfy//X/xuoT54TAfbW 7qF3ThR0ZgHPMMpXFbVG7e5ajjdVhp7uYU0LFokaMrNiCmmG1BHi/O5vS8Sd0g8q4E9n 1EGqsPiP1ebfVttbNubsYQ9jj8KhGcX4/f2mf8ND59B9i6i59jV0BH2im102zw6h7R2w vIPRvZXhtSTxDlDPNGjW3SNnQECG1VLODoFgXO0Y7pzXpveHdtyn5l43rGaaqqyxLpuw EQpw== X-Gm-Message-State: AOJu0Yyu2HjXR24drkfhCo0fvg067e6hJjt6wYRs0amR8KtsVELIUQCH h9cV6aIn8YncKVvd8oOmeIFVPVVSdDzNt/jNbjTWjIWpELIlccq4m9J1of5fK9ylOn4zuneuOHU jGNFVnDZhPKfVZjUjTorVBSBWYxsR62Ut+AIawZELGd9SsNHCOAVv45Zfm6qSe/SMlg== X-Gm-Gg: AeBDietIQbORTMaawG6IudYIDMgSE3K0z373XiZVMWo0L9IoMvjMbHCopJ0sq2CpRsS PptErdkm1U9UfKV0n6nXuyvDTIQoR2KTDfajjxu+DSf9ufRFxmw1DrMH8nWKfGXe9uVpe406nqf civb7JenbWAuSzvbfAl308gzUfKiPW1ZTFroqYAxUS8UmUIoSdO8TnLAg4vCVsl5iWBTQ9ALqr0 Rl7r6n9xGWz9RbmhOFgnyFoCjxsHwSKRcDFvTj7jpwJKMv4i/9clQ6DpCC2cYXDxr2JhxpHO69O aL2Cw4MTNHXYgxBPfHWG5i9rN9zDa2qaQG0jYD1++kYXTmRA5JmbvA/F4kCi8++mEXOlx3gq5Pi DqrUDsc4UtorIb8APr9aa6/sqhshP4+lDuSP3dJFEap/wmnbYzIqGMAnRjRJMLm6fUA== X-Received: by 2002:a17:907:3c86:b0:bc4:b9d3:1013 with SMTP id a640c23a62f3a-bc56ae28594mr491646866b.15.1778156877647; Thu, 07 May 2026 05:27:57 -0700 (PDT) X-Received: by 2002:a17:907:3c86:b0:bc4:b9d3:1013 with SMTP id a640c23a62f3a-bc56ae28594mr491644366b.15.1778156876865; Thu, 07 May 2026 05:27:56 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:56 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 10/11] selftests: ceph: add validation harness Date: Thu, 7 May 2026 12:27:36 +0000 Message-Id: <20260507122737.2804094-11-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a one-shot validation wrapper that orchestrates the full reset test suite with per-stage watchdog timeouts and a final status check. The harness runs five stages: baseline (no resets), corner cases, moderate stress, aggressive stress, and a post-run status validation. Each stage runs with an independent timeout so a hang in one stage does not block the entire run. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- .../filesystems/ceph/run_validation.sh | 350 ++++++++++++++++++ 1 file changed, 350 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation= .sh diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/t= ools/testing/selftests/filesystems/ceph/run_validation.sh new file mode 100755 index 000000000000..5d521e4f9e9b --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh @@ -0,0 +1,350 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset - single-command validation. +# Runs all test stages in sequence with per-stage timeouts. +# If any stage hangs (filesystem stuck, process blocked), the +# timeout kills it and reports failure. +# +# Usage: +# sudo ./run_validation.sh --mount-point /mnt/mycephfs +# +# Expected output on success: +# +# =3D=3D=3D CephFS Client Reset Validation =3D=3D=3D +# [stage 1/5] baseline PASS (60s, no resets) +# [stage 2/5] corner_cases PASS (4/4 passed) +# [stage 3/5] moderate PASS (120s, resets every 5-15s) +# [stage 4/5] aggressive PASS (120s, resets every 1-5s) +# [stage 5/5] status_check PASS (phase=3Didle, last_errno=3D0) +# +# RESULT: 5/5 stages passed +# Artifacts: /tmp/ceph_reset_validation_ + +set -uo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +CLIENT_ID=3D"" +declare -a CLIENT_ARGS=3D() +declare -a DEBUGFS_ARGS=3D() +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +OUT_DIR=3D"/tmp/ceph_reset_validation_${RUN_ID}" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" + +# Timeout margins: stage runtime + cooldown + validation + safety buffer +STAGE1_TIMEOUT=3D120 # 60s run + 20s cooldown + 40s buffer +STAGE2_TIMEOUT=3D300 # 4 corner cases, 30s each worst case + buffer +STAGE3_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE4_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE5_TIMEOUT=3D10 # just reading debugfs + +PASS=3D0 +FAIL=3D0 +TOTAL=3D5 + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point + +Options: + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_valid= ation_) + --client-id ID Ceph debugfs client id (optional) + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph) + --help Show this message +EOF +} + +stage_result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"$4" + + if [[ "$status" =3D=3D "PASS" ]]; then + PASS=3D$((PASS + 1)) + else + FAIL=3D$((FAIL + 1)) + fi + printf '[stage %d/%d] %-16s %s (%s)\n' "$num" "$TOTAL" "$name" "$status"= "$detail" +} + +# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout. +# Sets RUN_TIMED_OUT=3D1 if killed by timeout. +# +# The stage command runs in its own session/process group (via setsid). +# On timeout the entire process group is killed, not just the top-level +# script PID. This is required because stage scripts (reset_stress.sh, +# reset_corner_cases.sh) spawn child processes - I/O workers, rename +# workers, reset injectors, samplers - that would otherwise survive the +# timeout and bleed into later stages, invalidating results. +RUN_TIMED_OUT=3D0 + +run_with_timeout() +{ + local timeout_sec=3D"$1" + local logfile=3D"$2" + shift 2 + + RUN_TIMED_OUT=3D0 + + # Start the stage in its own session via setsid so all descendant + # processes share a process group that we can kill atomically. + # In a non-interactive script, background children are not process + # group leaders, so setsid(1) calls setsid(2) directly (no extra + # fork) and the PID we capture IS the group leader. + setsid "$@" > "$logfile" 2>&1 & + local pid=3D$! + + # Watchdog: on timeout, kill the entire process group + ( + sleep "$timeout_sec" + if kill -0 "$pid" 2>/dev/null; then + echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $p= id" >> "$logfile" + kill -TERM -- -"$pid" 2>/dev/null + sleep 2 + kill -KILL -- -"$pid" 2>/dev/null + fi + ) & + local watchdog_pid=3D$! + + # Wait for the stage command + wait "$pid" 2>/dev/null + local rc=3D$? + + # Kill the watchdog if it's still running + kill "$watchdog_pid" 2>/dev/null + wait "$watchdog_pid" 2>/dev/null + + # Check if it was killed by timeout + if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then + RUN_TIMED_OUT=3D1 + return 1 + fi + + return "$rc" +} + +find_status_path() +{ + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then + echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + return 0 + fi + return 1 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + if [[ -r "${entry}reset/status" ]]; then + echo "${entry}reset/status" + return 0 + fi + done + return 1 +} + +read_status_field() +{ + local status_path=3D"$1" + local field=3D"$2" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$status_path" 2>= /dev/null +} + +# --- Parse arguments ----------------------------------------------------= --- + +while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --out-dir) OUT_DIR=3D"$2"; shift 2 ;; + --client-id) CLIENT_ID=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac +done + +if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: --mount-point is required" >&2 + usage + exit "$KSFT_SKIP" +fi + +if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi + +# Auto-detect client id when not specified, so all stages (including +# stage 5 status check) use the same client consistently. +if [[ -z "$CLIENT_ID" ]]; then + candidates=3D() + for entry in "$DEBUGFS_ROOT"/*/; do + name=3D"$(basename "$entry")" + if [[ -r "${entry}reset/status" ]]; then + candidates+=3D("$name") + fi + done + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + elif [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client= -id." >&2 + exit "$KSFT_SKIP" + fi +fi + +if [[ -n "$CLIENT_ID" ]]; then + CLIENT_ARGS=3D(--client-id "$CLIENT_ID") +fi +DEBUGFS_ARGS=3D(--debugfs-root "$DEBUGFS_ROOT") + +# Quick sanity: can we write to the mount? +if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi +rm -f "$MOUNT_POINT/.validation_probe_$$" + +mkdir -p "$OUT_DIR" + +echo "" +echo "=3D=3D=3D CephFS Client Reset Validation =3D=3D=3D" +echo "" + +# --- Stage 1: Baseline (no resets) --------------------------------------= --- + +stage1_out=3D"$OUT_DIR/stage1_baseline" +if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile baseline \ + --no-reset \ + --duration-sec 60 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage1_out"; then + stage_result 1 "baseline" "PASS" "60s, no resets" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s" +else + stage_result 1 "baseline" "FAIL" "see $stage1_out.log" +fi + +# --- Stage 2: Corner cases ----------------------------------------------= --- + +stage2_out=3D"$OUT_DIR/stage2_corner_cases" +mkdir -p "$stage2_out" +if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \ + "$SCRIPT_DIR/reset_corner_cases.sh" \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --mount-point "$MOUNT_POINT"; then + pass_line=3D$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$s= tage2_out/output.log" | tail -1) + stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT= }s" +else + fail_line=3D$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo= "?") + stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_= out/output.log" +fi + +# --- Stage 3: Moderate resets -------------------------------------------= ---- + +stage3_out=3D"$OUT_DIR/stage3_moderate" +if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile moderate \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage3_out"; then + stage_result 3 "moderate" "PASS" "120s, resets every 5-15s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s" +else + stage_result 3 "moderate" "FAIL" "see $stage3_out.log" +fi + +# --- Stage 4: Aggressive resets -----------------------------------------= ---- + +stage4_out=3D"$OUT_DIR/stage4_aggressive" +if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile aggressive \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage4_out"; then + stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s" +else + stage_result 4 "aggressive" "FAIL" "see $stage4_out.log" +fi + +# --- Stage 5: Post-run status check -------------------------------------= --- + +status_path=3D"" +if status_path=3D$(find_status_path); then + phase=3D$(read_status_field "$status_path" "phase") + last_errno=3D$(read_status_field "$status_path" "last_errno") + failure_count=3D$(read_status_field "$status_path" "failure_count") + drain_timed_out=3D$(read_status_field "$status_path" "drain_timed_out") + sessions_reset=3D$(read_status_field "$status_path" "sessions_reset") + blocked=3D$(read_status_field "$status_path" "blocked_requests") + + # Save full status + cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null + + errors=3D"" + [[ "$phase" !=3D "idle" ]] && errors=3D"${errors}phase=3D$phase " + [[ "$last_errno" !=3D "0" ]] && errors=3D"${errors}last_errno=3D$last_err= no " + [[ "$failure_count" !=3D "0" && -n "$failure_count" ]] && errors=3D"${err= ors}failure_count=3D$failure_count " + [[ "$blocked" !=3D "0" ]] && errors=3D"${errors}blocked_requests=3D$block= ed " + + if [[ -z "$errors" ]]; then + detail=3D"phase=3D$phase, last_errno=3D$last_errno, failure_count=3D${fa= ilure_count:-0}" + [[ "$drain_timed_out" =3D=3D "yes" ]] && detail=3D"$detail, drain_timed_= out=3Dyes" + [[ -n "$sessions_reset" ]] && detail=3D"$detail, sessions_reset=3D$sessi= ons_reset" + stage_result 5 "status_check" "PASS" "$detail" + else + stage_result 5 "status_check" "FAIL" "$errors" + fi +else + stage_result 5 "status_check" "FAIL" "cannot read reset/status" +fi + +# --- Summary ------------------------------------------------------------= ---- + +echo "" +if [[ "$FAIL" -eq 0 ]]; then + echo "RESULT: $PASS/$TOTAL stages passed" +else + echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED" +fi +echo "Artifacts: $OUT_DIR" +echo "" + +exit "$FAIL" --=20 2.34.1 From nobody Sat Jun 13 13:34:47 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B0813F0A84 for ; Thu, 7 May 2026 12:28:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156885; cv=none; b=EUtGB5atb6AsHZmWdZKeBbvbWIxqhz0WJxf5aH+17kCyDmn0mG3pDI0VnxPtRLk2ydIpSbIkYVgbxQRV8IbMOpKmhVxYl8zFon+dFego0FFbwl4G4mrlyeGXyOxXMXFXpB01b4nb76VJ8Cp+YcRrBCWm9mpqNyCYrnmlhkgae9s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778156885; c=relaxed/simple; bh=WjhOLO37OUy3DRn1TE5ng3ua7CXTWY8Qs3RpRBaofdg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uiNJSNCf9lMnJm3u7WEwkWHab1Z9egZuVJ7lotbE3LWCXd6neJ+2Q1gHnx4JqlABnjCflQ11ianWbhdIyYUBsFPMymTFT50+duQ3FH4qIRWbY8skHmlNza1hldwLE/Tla1RDlmCLF3ga4ze8V3Q3E7zA0Jvh3Y60h9wvLToXzho= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=LQ8ZTwuz; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=XbN0nVWz; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="LQ8ZTwuz"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="XbN0nVWz" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778156881; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B+iED2/yFjWeWtKsIOHFNx3B45qGsWcF6UkxSG0bS5k=; b=LQ8ZTwuzyLtl1StmbO8Lw5b4LoadVMvYhbQVUKZsG1R4RUXwxM/Tq76t9Ar8KGHzSF+pZ8 7c39xlFuqiMbd7tVmbd3obihnEuQwgZoHsIRk7IFazLD0V48lwTCEWXVZ/JFisg+i3LMq7 KEB/HKulJQAUCbjTm64Sr7SBTsEkglk= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-455-EeIirzCZNtKp-s3sIrHFtA-1; Thu, 07 May 2026 08:28:00 -0400 X-MC-Unique: EeIirzCZNtKp-s3sIrHFtA-1 X-Mimecast-MFC-AGG-ID: EeIirzCZNtKp-s3sIrHFtA_1778156879 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-ba70cda6755so70016566b.1 for ; Thu, 07 May 2026 05:27:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778156879; x=1778761679; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=B+iED2/yFjWeWtKsIOHFNx3B45qGsWcF6UkxSG0bS5k=; b=XbN0nVWzXdj4jd0GmNpqKmY4z+ZZrzFelDEN1loBpROebDfsLloNzZ0gfAI4fFU14O KGzBG6NpcIpQbvr55TdHAIyqOY7EPRjl4G+cX1RYbCVi40ttzLYGY0t2ffnFxl6HoOAs pR3dfJeBOBn/E8kG3e7xeskWJn4N1KBsvIlfDfabIUp5/95Eopf6Tbxul0Nyid67a6mY SRztUSVGu0UpbLsYyXX707+5eBUxltVGx9dLZqSePiqBVB2MUaKiZOFLInEEGkThJCXw kkAZ7yuz9Z7StgLd62TwsJzhGAQJVmvfdjVsiSAp0vsfL+sgSqIIiIZ92qWy5taEIJS5 ZxtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778156879; x=1778761679; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=B+iED2/yFjWeWtKsIOHFNx3B45qGsWcF6UkxSG0bS5k=; b=URWm12CLw6f/AhJPeDzBytX8iNoIWnPktKIYFNuiVz7MHtEw8HINobmtxj6OxAnU57 wH9dwqFHhqexW39C+s+ESEYl+a6JG9vxcVVXIGOkhRA7ndHuTpAOcQWXj5uAjAEZ5Ybm V/i5FOVxoq0SVBE4isZ1RyTNMvTVefLC9t+6s2YW9rxNNIf4GWbVmoCiKhiY8BoR5sK5 lS3i4jm3vzDYsKNoF2lCsWUsxiNmp5qTJlYvooLM6YZzNPB79FeR8wvEsYnpVaWqbaEy Iq0wAd3JMazxYWx2SkLwdIFBfZvgqsU6B6XkxSOeFf89qZ86GMD7i5s3eHnd4qCv6Xbg T0YQ== X-Gm-Message-State: AOJu0YwfMvFTidiTnuVrMSh9BM1MGotiEXS95LMRAsMAMBqXPsdgoNn3 3MeV5razeThG9o3rebNAT7vpk0Xqqws2dMi2tqL8drAlpmpvZygUzuOX/u0uh+XO65Fyqw5SMOo XM13FaNsWqxJw4+TZ7FRKV7xmoUoqZrNpAv/E6BoJxBEV5hFMOLSloFjD5OgvC1pvBQ== X-Gm-Gg: AeBDieu2b3K619uAqeKkSpSXTr45tI6yismLFoh1dSWqdXJfHjPKeRVPnAzWCA+IaQF VfonrYM+Q8AFddpf+0UznGA17YDHaL3EzjEDsfBPPMh8xpSTe2Z/I7N/dZTMihIeK70achnfXSr VwwaHoohyrzTZ+d3HzuGE9/KKUb2P+C88UVrY38m+CxLjgkwGXqMex2OfSafcRWVGJKg23efdhy cCiwQyu+50w2DXWxiDXs23LR7UOHG1kRbxRhPAcpFmCUeuzZRDrXyvqtoizwy0jIJb9WOCeZQWq ZpT4h7ThPtyCbxOxLMfVbFUX5u8SfEh29Mv8/vQPEC3ZdZxwWfOumkwzlWybd3jwjD8lrkdmw00 /I1jbpUqvASA85+HiHZknQrPBjqGhoESuU3w3duWj5tzZexjbupNDGolnzNpLKzzm4A== X-Received: by 2002:a17:907:c18:b0:bba:8587:1164 with SMTP id a640c23a62f3a-bc56c5231ecmr477101366b.15.1778156878356; Thu, 07 May 2026 05:27:58 -0700 (PDT) X-Received: by 2002:a17:907:c18:b0:bba:8587:1164 with SMTP id a640c23a62f3a-bc56c5231ecmr477099666b.15.1778156877704; Thu, 07 May 2026 05:27:57 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-bc81cd34ce8sm76552566b.9.2026.05.07.05.27.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 05:27:57 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Date: Thu, 7 May 2026 12:27:37 +0000 Message-Id: <20260507122737.2804094-12-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Wire the CephFS reset test suite into the kselftest build: - Add filesystems/ceph to the top-level selftests Makefile. - Add the per-suite Makefile with run_validation.sh as TEST_PROGS. - Add the settings file (kselftest timeout). - Add the MAINTAINERS entry for the test directory. - Add README with prerequisites, usage, and troubleshooting. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Tested-by: Viacheslav Dubeyko --- MAINTAINERS | 1 + fs/ceph/mds_client.c | 3 +- fs/ceph/mds_client.h | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/filesystems/ceph/Makefile | 7 ++ .../testing/selftests/filesystems/ceph/README | 84 +++++++++++++++++++ .../selftests/filesystems/ceph/settings | 1 + 7 files changed, 97 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile create mode 100644 tools/testing/selftests/filesystems/ceph/README create mode 100644 tools/testing/selftests/filesystems/ceph/settings diff --git a/MAINTAINERS b/MAINTAINERS index 2fb1c75afd16..bf6d973ac3fb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5905,6 +5905,7 @@ B: https://tracker.ceph.com/ T: git https://github.com/ceph/ceph-client.git F: Documentation/filesystems/ceph.rst F: fs/ceph/ +F: tools/testing/selftests/filesystems/ceph/ =20 CERTIFICATE HANDLING M: David Howells diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index b16638ebff7f..3b6560da8c4e 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2359,6 +2359,7 @@ struct flush_dump_entry { static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid) { struct ceph_client *cl =3D mdsc->fsc->client; + int i; struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES]; struct ceph_cap_flush *cf; int n =3D 0, remaining =3D 0; @@ -2388,7 +2389,7 @@ static void dump_cap_flushes(struct ceph_mds_client *= mdsc, u64 want_tid) =20 pr_info_client(cl, "still waiting for cap flushes through %llu:\n", want_tid); - for (int i =3D 0; i < n; i++) { + for (i =3D 0; i < n; i++) { struct flush_dump_entry *e =3D &entries[i]; =20 if (e->ci_null) diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index b1a0621cd37e..731d6ad04956 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -121,6 +121,7 @@ static inline bool ceph_reset_is_idle(struct ceph_clien= t_reset_state *st) { return READ_ONCE(st->phase) =3D=3D CEPH_CLIENT_RESET_IDLE; } + struct ceph_mds_cap_match { s64 uid; /* default to MDS_AUTH_UID_ANY */ u32 num_gids; diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Mak= efile index 6e59b8f63e41..ab254ae793a9 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -32,6 +32,7 @@ TARGETS +=3D exec TARGETS +=3D fchmodat2 TARGETS +=3D filesystems TARGETS +=3D filesystems/binderfs +TARGETS +=3D filesystems/ceph TARGETS +=3D filesystems/epoll TARGETS +=3D filesystems/fat TARGETS +=3D filesystems/overlayfs diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/test= ing/selftests/filesystems/ceph/Makefile new file mode 100644 index 000000000000..4ad3e8d40d90 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_PROGS :=3D run_validation.sh +TEST_FILES :=3D reset_stress.sh reset_corner_cases.sh \ + validate_consistency.py README settings + +include ../../lib.mk diff --git a/tools/testing/selftests/filesystems/ceph/README b/tools/testin= g/selftests/filesystems/ceph/README new file mode 100644 index 000000000000..eb0092b38f80 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/README @@ -0,0 +1,84 @@ +# CephFS Client Reset Test Suite + +Test suite for the CephFS kernel client manual session reset feature. +This trimmed set contains the single-client stress test, the targeted +corner-case test, and the one-shot validation harness used during +feature bring-up. + +## Prerequisites + +- Linux kernel with the CephFS client reset feature (this branch) +- A running Ceph cluster with at least one MDS +- Root access (debugfs requires it) +- Python 3 (for validators) +- flock utility (for lock tests, usually in util-linux) + +## Test inventory + +| Test | Script(s) | What it covers | +|------|-----------|----------------| +| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity= on one mount | +| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclai= m, unmount-during-reset | +| Validation harness | `run_validation.sh` | baseline + corner cases + mod= erate/aggressive stress + final status check | + +## Quick start + +Stress run: + + sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate + +Corner cases: + + sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs + +End-to-end validation: + + sudo ./run_validation.sh --mount-point /mnt/cephfs + +## Stress profiles + + baseline - no resets, 1 IO + 1 rename, 600s + moderate - reset every 5-15s, 2 IO + 1 rename, 900s + aggressive - reset every 1-5s, 4 IO + 2 rename, 900s + soak - reset every 5-15s, 2 IO + 1 rename, 3600s + +## Key options (all scripts) + + --mount-point PATH CephFS mount point (required) + --client-id ID Debugfs client id (auto-detected if one) + +reset_stress.sh additionally accepts: + + --profile NAME baseline|moderate|aggressive|soak + --duration-sec N Override profile runtime + --no-reset Disable reset injection + --out-dir PATH Artifact directory + +## Corner case tests + + [1/4] ebusy_rejection Second reset rejected while first in-flight + [2/4] dirty_caps_at_reset Reset with unflushed dirty caps + [3/4] flock_after_reset Stale lock EIO + fresh lock after holder e= xit + [4/4] unmount_during_reset umount during active reset (destroy-path w= akeup) + +Test 4 requires creating a second CephFS mount instance and SKIPs if +the host cannot do so. See `--help` output for details. + +## Troubleshooting + +**No writable Ceph reset interface found:** +Kernel lacks the reset feature, debugfs not mounted, or not root. +Check: `ls /sys/kernel/debug/ceph/*/reset/` + +**Multiple Ceph clients found:** +Use `--client-id` to select one. +List: `ls /sys/kernel/debug/ceph/` + +## Files + +| File | Role | +|------|------| +| `reset_stress.sh` | Single-client stress test runner | +| `validate_consistency.py` | Single-client post-run validator | +| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) | +| `run_validation.sh` | One-shot validation harness | diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/test= ing/selftests/filesystems/ceph/settings new file mode 100644 index 000000000000..79b65bdf05db --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/settings @@ -0,0 +1 @@ +timeout=3D1200 --=20 2.34.1