From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A1073E5571
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:18 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272480; cv=none;
 b=RCfYkzyDhnIKAcSUiJDA0c/jN34Fvjv1Xvr1zMok6r1Rnb88fgFpWVBAQMp0wXXKZPyIUOQSiRAq/ppaPC/lxThLoBWY43HS5jZ6U1K4wJEGnyR3nFN2Wkxc8G0nu8wMjc+jp6heuhsXM/M/eLO6OXSXULpor+EqK5V8EEnS+Jk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272480; c=relaxed/simple;
	bh=iPptitDCHY6iUmzXK5v2EJe1oBKCUy+jF+gFF6ntNCM=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Fc2kioJA0CsdPREYjSnLIEbTINKyx3pZ30RFzRAdF5iPe8MtDTv+tmJqDu1K1qtKpMVcDJMxkYDcAq31x+Zll49CYEdZTYlD73thxTpJNtzXHxDB3ATlVjthmlU0Ee13Obe5lmfpysbgiyiFM8P+nLudjyRWlSTM+Hp2xEAhCFE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=foB3wl1E;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=sYE6ge+0; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="foB3wl1E";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="sYE6ge+0"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272477;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=;
	b=foB3wl1E0jWDc4YomNnkdZQIiGJwZjhitaxaP5L/KPDPQiAZiBTCh1Zi4tHSKtUQhINy6y
	adX0FKokUDJrrOjdCaT3Qi5uDr3Mlrr1nUdlM2vlxEM8ixiur0AAfkdEP814v/qome1crB
	YWs5dxD/OVH8/btbi1aHCreZMxq0kIo=
Received: from mail-oa1-f71.google.com (mail-oa1-f71.google.com
 [209.85.160.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-48-r__5RY4hOqiWU5XiyCaebA-1; Wed, 15 Apr 2026 13:01:05 -0400
X-MC-Unique: r__5RY4hOqiWU5XiyCaebA-1
X-Mimecast-MFC-AGG-ID: r__5RY4hOqiWU5XiyCaebA_1776272465
Received: by mail-oa1-f71.google.com with SMTP id
 586e51a60fabf-42467c9547bso2301597fac.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272465; x=1776877265;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=;
        b=sYE6ge+0bu/RQBx8/YP20kDuuY1JacHNCoRcfret5o3Io9fDYgE2YPXTerc1fb4Xty
         8MmzhP1KOKRplVMCSyRy/asUOUYqespq8AhduI+1CvXYdxL0AqUqthv57TxjeEMbRZ1S
         PounRg6nLEugJzDlm4SfjS7UTjQFCEFvGSEw4qT2pvYSBqKPVSmib2kfenZvo+9xDi23
         U9a6GkAy2RgdGYfYEqSAmaQTfitUh8Oog9RuwsDdGkMGEE/hTVVMCYn+b8fvkDDMt1SG
         PEaVY2N0utgFdLmLqSL6nQXPVOexiPtvcHu/W3v2lHcsQbyh9FX0FM+wWzSt1RjBhWKa
         QFSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272465; x=1776877265;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=;
        b=B21XDY9/eg1wyzL3iKbA0jhPH8wdmoYq2KbdLSLmXg29mKIF/+jyWSOu33EEluqQ1R
         FkJwIDtmdk0SGztx0TXZtXPmBO4Y9Sbq2iaDR6PkC6m0MDo/NyWZUxz9C3TEnFMwZBSx
         kd4neh1WGVoXnJd49axXgGTxhHx9ftuCqs+sofKmGkKBDBTBmpyhqpKX06KOd9ueN+yX
         gQgVwj0n3bh/WwGUZyXyFXILe8WthTnc0Pwk0lfQSHWGG9jQLKBDhk0EzneGoSgCgOsK
         +leJDNAR/7gV07Amg6G5ICbtxlv3DuHn0WEx01j2xFkIGePJbVsJabeF7tMX+8hUCb9/
         m8HA==
X-Gm-Message-State: AOJu0YzMhAIdJ9D5TAapGACp0x+JjiUSHRMOb7ILfh1frPtk2KNMI9bW
	0JXkf4aWKM9WImirvUuYxvKXYSm89LZAkybPYQjyL/O8Hs50n+nKhw6/BgBW2hf/gjXDZIdzdZx
	TVl42dJol2FKWYBeizxt1Z/meG4L+e4Y/zrgLF0CDdfqd6CdbvMyVoyYszGFLqCt1NnRspzYi1b
	zt
X-Gm-Gg: AeBDieuAnpUf/5awxEhYeNA20I/JAn5ZfXF3IDHJ7EXkhEZkvhu3D6VFaO7Cv4bnsAr
	NR7+WLMmTGGnKJBo+kod52a7qgVouH6Miz1t4EQW9LznoSA1HsOvkZMOAPaSm8OfaNTtimlBZN6
	o2t2Woylb5BvREglsmmNqbm9XC37qEutCG5PCO3NeSYn6kTIIcgTW+Ao59ctqK0ABxzdp5LIj+R
	pwxirYIF3ziM8cNil58KtHlAMHeV3gPhrJRYeflOP4HaFVw0oS65VedLPEBaDte6ny2yFdhqF3/
	j2wK/380lA+/pXWcPmcKjjE7ZMbjMMqfqk0Qvxdbf5KqCFPUuV7sstdY2e2Zk4N3KYzSlkPfCtr
	Txey9OPGFFg0E5V1JnOqbLvuX+JpnZCkjXrmBhiqPJuYmfpR7HsB7cYzBcvhN32aoDA==
X-Received: by 2002:a05:6820:330d:b0:67f:c06c:a5e6 with SMTP id
 006d021491bc7-68be7ee6967mr7305587eaf.37.1776272462752;
        Wed, 15 Apr 2026 10:01:02 -0700 (PDT)
X-Received: by 2002:a05:6820:330d:b0:67f:c06c:a5e6 with SMTP id
 006d021491bc7-68be7ee6967mr7305554eaf.37.1776272462033;
        Wed, 15 Apr 2026 10:01:02 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:01 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 1/7] ceph: convert inode flags to named bit positions and
 atomic bitops
Date: Wed, 15 Apr 2026 17:00:37 +0000
Message-Id: <20260415170043.3882912-2-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Define named bit-position constants for all CEPH_I_* inode flags and
derive the bitmask values from them.  This gives every flag a named
_BIT constant usable with the test_bit/set_bit/clear_bit family.
The intentionally unused bit position 1 is documented inline.

Convert all flag modifications to use atomic bitops (set_bit,
clear_bit, test_and_clear_bit).  The previous code mixed lockless
atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
read-modify-write (|=3D / &=3D ~) on other flags sharing the same
unsigned long.  A concurrent non-atomic RMW can clobber an
adjacent lockless atomic update -- for example, a lockless
clear_bit(ERROR_WRITE) could be silently resurrected by a
concurrent ci->i_ceph_flags |=3D CEPH_I_FLUSH under the spinlock.
Using atomic bitops for all modifications eliminates this class
of race entirely.

Flags whose only users are now the _BIT form (ERROR_WRITE,
ERROR_FILELOCK, SHUTDOWN, ASYNC_CHECK_CAPS) have their old mask
defines removed to document that callers must use the _BIT
constant with the set_bit/test_bit family.

Flag reads under i_ceph_lock continue to use bitmask tests where
the tested flag is only modified under the same lock; this is safe
because the lock serialises both the read and the write.  The
remaining flags continue to use non-atomic bitmask operations under
i_ceph_lock, which is correct and unchanged.

The lockless reader ceph_inode_is_shutdown() retains the READ_ONCE()
snapshot plus bitmask test pattern -- the single atomic load into a
local variable is correct and avoids a second memory access that
test_bit() would require.

The direct assignment in ceph_finish_async_create() is converted
from i_ceph_flags =3D CEPH_I_ASYNC_CREATE to set_bit().  This
inode is I_NEW at this point -- still invisible to other threads
and guaranteed to have zero flags from alloc_inode -- so either
form is safe, but set_bit() keeps the conversion uniform.

The only remaining direct assignment (alloc_inode zeroing) operates
on an inode that is not yet visible to other threads, so it is safe
without atomic ops.

The dead precomputed flags variable in ceph_pool_perm_check() is
removed; the check: loop re-reads flags from i_ceph_flags after
the set_bit() calls, keeping a single source of truth.

Co-developed-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/addr.c       | 16 +++++------
 fs/ceph/caps.c       | 24 ++++++++---------
 fs/ceph/file.c       | 12 ++++-----
 fs/ceph/inode.c      |  4 +--
 fs/ceph/locks.c      | 22 ++++-----------
 fs/ceph/mds_client.c |  3 ++-
 fs/ceph/mds_client.h |  2 +-
 fs/ceph/snap.c       |  2 +-
 fs/ceph/super.h      | 64 ++++++++++++++++++++++----------------------
 fs/ceph/xattr.c      |  2 +-
 10 files changed, 69 insertions(+), 82 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2090fc78529c..bde9efffa228 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -2583,20 +2583,18 @@ int ceph_pool_perm_check(struct inode *inode, int n=
eed)
 	if (ret < 0)
 		return ret;
=20
-	flags =3D CEPH_I_POOL_PERM;
-	if (ret & POOL_READ)
-		flags |=3D CEPH_I_POOL_RD;
-	if (ret & POOL_WRITE)
-		flags |=3D CEPH_I_POOL_WR;
-
 	spin_lock(&ci->i_ceph_lock);
 	if (pool =3D=3D ci->i_layout.pool_id &&
 	    pool_ns =3D=3D rcu_dereference_raw(ci->i_layout.pool_ns)) {
-		ci->i_ceph_flags |=3D flags;
-        } else {
+		set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
+		if (ret & POOL_READ)
+			set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags);
+		if (ret & POOL_WRITE)
+			set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags);
+	} else {
 		pool =3D ci->i_layout.pool_id;
-		flags =3D ci->i_ceph_flags;
 	}
+	flags =3D ci->i_ceph_flags;
 	spin_unlock(&ci->i_ceph_lock);
 	goto check;
 }
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d51454e995a8..cb9e78b713d9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_c=
lient *mdsc,
=20
 	doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
 	spin_lock(&mdsc->cap_delay_lock);
-	ci->i_ceph_flags |=3D CEPH_I_FLUSH;
+	set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
 	if (!list_empty(&ci->i_cap_delay_list))
 		list_del_init(&ci->i_cap_delay_list);
 	list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list);
@@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, stru=
ct ceph_cap *cap,
 	      ceph_cap_string(revoking));
 	BUG_ON((retain & CEPH_CAP_PIN) =3D=3D 0);
=20
-	ci->i_ceph_flags &=3D ~CEPH_I_FLUSH;
+	clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
=20
 	cap->issued &=3D retain;  /* drop bits we don't want */
 	/*
@@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info=
 *ci,
 		last_tid =3D capsnap->cap_flush.tid;
 	}
=20
-	ci->i_ceph_flags &=3D ~CEPH_I_FLUSH_SNAPS;
+	clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
=20
 	while (first_tid <=3D last_tid) {
 		struct ceph_cap *cap =3D ci->i_auth_cap;
@@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int =
flags)
=20
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		ci->i_ceph_flags |=3D CEPH_I_ASYNC_CHECK_CAPS;
+		set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags);
=20
 		/* Don't send messages until we get async create reply */
 		spin_unlock(&ci->i_ceph_lock);
@@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_clie=
nt *mdsc,
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE)
 		return;
=20
-	ci->i_ceph_flags &=3D ~CEPH_I_KICK_FLUSH;
+	clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
=20
 	list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) {
 		if (cf->is_capsnap) {
@@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_cl=
ient *mdsc,
 			__kick_flushing_caps(mdsc, session, ci,
 					     oldest_flush_tid);
 		} else {
-			ci->i_ceph_flags |=3D CEPH_I_KICK_FLUSH;
+			set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
 		}
=20
 		spin_unlock(&ci->i_ceph_lock);
@@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int =
need, int want,
 	spin_lock(&ci->i_ceph_lock);
=20
 	if ((flags & CHECK_FILELOCK) &&
-	    (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) {
+	    test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		doutc(cl, "%p %llx.%llx error filelock\n", inode,
 		      ceph_vinop(inode));
 		ret =3D -EIO;
@@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_i=
nfo *ci,
 		BUG_ON(capsnap->cap_flush.tid > 0);
 		ceph_put_snap_context(capsnap->context);
 		if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps))
-			ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS;
+			set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
=20
 		list_del(&capsnap->ci_item);
 		ceph_put_cap_snap(capsnap);
@@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_inf=
o *ci, int nr,
 				if (ceph_try_drop_cap_snap(ci, capsnap)) {
 					put++;
 				} else {
-					ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS;
+					set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 					flush_snaps =3D true;
 				}
 			}
@@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode,
=20
 		if (ci->i_layout.pool_id !=3D old_pool ||
 		    extra_info->pool_ns !=3D old_ns)
-			ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM;
+			clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
=20
 		extra_info->pool_ns =3D old_ns;
=20
@@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode)
 			doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode,
 			      ceph_vinop(inode));
 			spin_lock(&mdsc->cap_delay_lock);
-			ci->i_ceph_flags |=3D CEPH_I_FLUSH;
+			set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
 			if (!list_empty(&ci->i_cap_delay_list))
 				list_del_init(&ci->i_cap_delay_list);
 			list_add_tail(&ci->i_cap_delay_list,
@@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct =
ceph_cap *cap, bool *invali
=20
 		if (atomic_read(&ci->i_filelock_ref) > 0) {
 			/* make further file lock syscall return -EIO */
-			ci->i_ceph_flags |=3D CEPH_I_ERROR_FILELOCK;
+			set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
 			pr_warn_ratelimited_client(cl,
 				" dropping file locks for %p %llx.%llx\n",
 				inode, ceph_vinop(inode));
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5e7c73a29aa3..2b457dab0837 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -579,12 +579,11 @@ static void wake_async_create_waiters(struct inode *i=
node,
=20
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
+		clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
=20
-		if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) {
-			ci->i_ceph_flags &=3D ~CEPH_I_ASYNC_CHECK_CAPS;
+		if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT,
+				      &ci->i_ceph_flags))
 			check_cap =3D true;
-		}
 	}
 	ceph_kick_flushing_inode_caps(session, ci);
 	spin_unlock(&ci->i_ceph_lock);
@@ -747,7 +746,8 @@ static int ceph_finish_async_create(struct inode *dir, =
struct inode *inode,
 			 * that point and don't worry about setting
 			 * CEPH_I_ASYNC_CREATE.
 			 */
-			ceph_inode(inode)->i_ceph_flags =3D CEPH_I_ASYNC_CREATE;
+			set_bit(CEPH_I_ASYNC_CREATE_BIT,
+				&ceph_inode(inode)->i_ceph_flags);
 			unlock_new_inode(inode);
 		}
 		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
@@ -2422,7 +2422,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, st=
ruct iov_iter *from)
=20
 	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) =3D=3D 0 ||
 	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
-	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+	    test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
 		struct ceph_snap_context *snapc;
 		struct iov_iter data;
=20
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index d99e12d1100b..f75d66760d54 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1142,7 +1142,7 @@ int ceph_fill_inode(struct inode *inode, struct page =
*locked_page,
 		rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns);
=20
 		if (ci->i_layout.pool_id !=3D old_pool || pool_ns !=3D old_ns)
-			ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM;
+			clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
=20
 		pool_ns =3D old_ns;
=20
@@ -3199,7 +3199,7 @@ void ceph_inode_shutdown(struct inode *inode)
 	bool invalidate =3D false;
=20
 	spin_lock(&ci->i_ceph_lock);
-	ci->i_ceph_flags |=3D CEPH_I_SHUTDOWN;
+	set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags);
 	p =3D rb_first(&ci->i_caps);
 	while (p) {
 		struct ceph_cap *cap =3D rb_entry(p, struct ceph_cap, ci_node);
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index dd764f9c64b9..c4ff2266bb94 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
 	ci =3D ceph_inode(inode);
 	if (atomic_dec_and_test(&ci->i_filelock_ref)) {
 		/* clear error when all locks are released */
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &=3D ~CEPH_I_ERROR_FILELOCK;
-		spin_unlock(&ci->i_ceph_lock);
+		clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
 	}
 	fl->fl_u.ceph.inode =3D NULL;
 	iput(inode);
@@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file=
_lock *fl)
 	else if (IS_SETLKW(cmd))
 		wait =3D 1;
=20
-	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
-		err =3D -EIO;
-	}
-	spin_unlock(&ci->i_ceph_lock);
-	if (err < 0) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
 			posix_lock_file(file, fl, NULL);
-		return err;
+		return -EIO;
 	}
=20
 	if (lock_is_read(fl))
@@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct fil=
e_lock *fl)
=20
 	doutc(cl, "fl_file: %p\n", fl->c.flc_file);
=20
-	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
-		err =3D -EIO;
-	}
-	spin_unlock(&ci->i_ceph_lock);
-	if (err < 0) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		if (lock_is_unlock(fl))
 			locks_lock_file_wait(file, fl);
-		return err;
+		return -EIO;
 	}
=20
 	if (IS_SETLKW(cmd))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..ccf0d53dde2b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3613,7 +3613,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
=20
 		spin_lock(&ci->i_ceph_lock);
 		cap =3D ci->i_auth_cap;
-		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds !=3D cap->mds) {
+		if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
+		    mds !=3D cap->mds) {
 			doutc(cl, "session changed for auth cap %d -> %d\n",
 			      cap->session->s_mds, session->s_mds);
=20
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0428a5eaf28c..e91a199d56fd 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -658,7 +658,7 @@ static inline int ceph_wait_on_async_create(struct inod=
e *inode)
 {
 	struct ceph_inode_info *ci =3D ceph_inode(inode);
=20
-	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT,
 			   TASK_KILLABLE);
 }
=20
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 52b4c2684f92..9b79a5eaca93 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
 		return 0;
 	}
=20
-	ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS;
+	set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 	doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=3D%llu\n",
 	      inode, ceph_vinop(inode), capsnap, capsnap->context,
 	      capsnap->context->seq, ceph_cap_string(capsnap->dirty),
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 29a980e22dc2..c89ad8dcc969 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -655,23 +655,32 @@ static inline struct inode *ceph_find_inode(struct su=
per_block *sb,
 /*
  * Ceph inode.
  */
-#define CEPH_I_DIR_ORDERED	(1 << 0)  /* dentries in dir are ordered */
-#define CEPH_I_FLUSH		(1 << 2)  /* do not delay flush of dirty metadata */
-#define CEPH_I_POOL_PERM	(1 << 3)  /* pool rd/wr bits are valid */
-#define CEPH_I_POOL_RD		(1 << 4)  /* can read from pool */
-#define CEPH_I_POOL_WR		(1 << 5)  /* can write to pool */
-#define CEPH_I_SEC_INITED	(1 << 6)  /* security initialized */
-#define CEPH_I_KICK_FLUSH	(1 << 7)  /* kick flushing caps */
-#define CEPH_I_FLUSH_SNAPS	(1 << 8)  /* need flush snapss */
-#define CEPH_I_ERROR_WRITE	(1 << 9) /* have seen write errors */
-#define CEPH_I_ERROR_FILELOCK	(1 << 10) /* have seen file lock errors */
-#define CEPH_I_ODIRECT_BIT	(11) /* inode in direct I/O mode */
-#define CEPH_I_ODIRECT		(1 << CEPH_I_ODIRECT_BIT)
-#define CEPH_ASYNC_CREATE_BIT	(12)	  /* async create in flight for this */
-#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
-#define CEPH_I_SHUTDOWN		(1 << 13) /* inode is no longer usable */
-#define CEPH_I_ASYNC_CHECK_CAPS	(1 << 14) /* check caps immediately after =
async
-					     creating finishes */
+#define CEPH_I_DIR_ORDERED_BIT		(0)  /* dentries in dir are ordered */
+					     /* bit 1 historically unused */
+#define CEPH_I_FLUSH_BIT		(2)  /* do not delay flush of dirty metadata */
+#define CEPH_I_POOL_PERM_BIT		(3)  /* pool rd/wr bits are valid */
+#define CEPH_I_POOL_RD_BIT		(4)  /* can read from pool */
+#define CEPH_I_POOL_WR_BIT		(5)  /* can write to pool */
+#define CEPH_I_SEC_INITED_BIT		(6)  /* security initialized */
+#define CEPH_I_KICK_FLUSH_BIT		(7)  /* kick flushing caps */
+#define CEPH_I_FLUSH_SNAPS_BIT		(8)  /* need flush snaps */
+#define CEPH_I_ERROR_WRITE_BIT		(9)  /* have seen write errors */
+#define CEPH_I_ERROR_FILELOCK_BIT	(10) /* have seen file lock errors */
+#define CEPH_I_ODIRECT_BIT		(11) /* inode in direct I/O mode */
+#define CEPH_I_ASYNC_CREATE_BIT		(12) /* async create in flight for this */
+#define CEPH_I_SHUTDOWN_BIT		(13) /* inode is no longer usable */
+#define CEPH_I_ASYNC_CHECK_CAPS_BIT	(14) /* check caps after async creatin=
g finishes */
+
+#define CEPH_I_DIR_ORDERED		(1 << CEPH_I_DIR_ORDERED_BIT)
+#define CEPH_I_FLUSH			(1 << CEPH_I_FLUSH_BIT)
+#define CEPH_I_POOL_PERM		(1 << CEPH_I_POOL_PERM_BIT)
+#define CEPH_I_POOL_RD			(1 << CEPH_I_POOL_RD_BIT)
+#define CEPH_I_POOL_WR			(1 << CEPH_I_POOL_WR_BIT)
+#define CEPH_I_SEC_INITED		(1 << CEPH_I_SEC_INITED_BIT)
+#define CEPH_I_KICK_FLUSH		(1 << CEPH_I_KICK_FLUSH_BIT)
+#define CEPH_I_FLUSH_SNAPS		(1 << CEPH_I_FLUSH_SNAPS_BIT)
+#define CEPH_I_ODIRECT			(1 << CEPH_I_ODIRECT_BIT)
+#define CEPH_I_ASYNC_CREATE		(1 << CEPH_I_ASYNC_CREATE_BIT)
=20
 /*
  * Masks of ceph inode work.
@@ -684,27 +693,18 @@ static inline struct inode *ceph_find_inode(struct su=
per_block *sb,
=20
 /*
  * We set the ERROR_WRITE bit when we start seeing write errors on an inode
- * and then clear it when they start succeeding. Note that we do a lockless
- * check first, and only take the lock if it looks like it needs to be cha=
nged.
- * The write submission code just takes this as a hint, so we're not too
- * worried if a few slip through in either direction.
+ * and then clear it when they start succeeding. The write submission code
+ * just takes this as a hint, so we're not too worried if a few slip throu=
gh
+ * in either direction.
  */
 static inline void ceph_set_error_write(struct ceph_inode_info *ci)
 {
-	if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags |=3D CEPH_I_ERROR_WRITE;
-		spin_unlock(&ci->i_ceph_lock);
-	}
+	set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 }
=20
 static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
 {
-	if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &=3D ~CEPH_I_ERROR_WRITE;
-		spin_unlock(&ci->i_ceph_lock);
-	}
+	clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 }
=20
 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
@@ -1142,7 +1142,7 @@ static inline bool ceph_inode_is_shutdown(struct inod=
e *inode)
 	struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode);
 	int state =3D READ_ONCE(fsc->mount_state);
=20
-	return (flags & CEPH_I_SHUTDOWN) || state >=3D CEPH_MOUNT_SHUTDOWN;
+	return (flags & BIT(CEPH_I_SHUTDOWN_BIT)) || state >=3D CEPH_MOUNT_SHUTDO=
WN;
 }
=20
 /* xattr.c */
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 5f87f62091a1..7cf9e908c2fe 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const ch=
ar *name, void *value,
 	if (current->journal_info &&
 	    !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) &&
 	    security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN))
-		ci->i_ceph_flags |=3D CEPH_I_SEC_INITED;
+		set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags);
 out:
 	spin_unlock(&ci->i_ceph_lock);
 	return err;
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2BC43E3D93
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272489; cv=none;
 b=bZYJULLhi3ztkPnMCnBAwlxgrClOSKn2hSsg0DPbFpLQAnwHhSE86IHKYfvlUeBYMpljaUPz9iYZI9CUAgHW0YRocPB9G4SHQ95EuSxXHL2+CHrnw2PawDl9WGESxB2ZRIBYy/pgStMlr6W7HulKhaDUmLgdvIG37+rUOptUTXU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272489; c=relaxed/simple;
	bh=nKbqGY84rIJB8BOdLzFBm2ltPcQM/ROZhxewUnBDS9I=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=U3RzNKLe3oJiUWjTo0++gSf7KHc5uyjg6+kNC/gNLTehsr/QEID02X4qTnBIpBobZn4+J7LztvqwayvsKFX2lzF+Ba2ZAfg2VBdN7ib3egDrJLVlbLWgq2xMH9vndt71UqXtXpBi23Ar6oKs/w2slC01TETbGDuv89La0AP5EjE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NtDYC7+y;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=nSmqfcM0; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NtDYC7+y";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="nSmqfcM0"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272485;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=;
	b=NtDYC7+yejGz1eAGaQz89VYzXVESMEfQVNHK8pdgMcpEU84fR+n0uTvbvkbQKug8LvIc8u
	JkBvcvGdrnmpzlWJa6Vyo5vlS4TsEKPiN8X8cDUnCW+fKQMvRoJqUKCvYGXmVaav72JOd5
	VLjq7FgaL/rb2MdbUeDy8ZxInHOXUcU=
Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com
 [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-146-B8K31ByoPNKk0w8QMnZGsw-1; Wed, 15 Apr 2026 13:01:23 -0400
X-MC-Unique: B8K31ByoPNKk0w8QMnZGsw-1
X-Mimecast-MFC-AGG-ID: B8K31ByoPNKk0w8QMnZGsw_1776272483
Received: by mail-qt1-f198.google.com with SMTP id
 d75a77b69052e-50d63962d83so167591911cf.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272483; x=1776877283;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=;
        b=nSmqfcM04Cx16DM6QJrk98i5mkVXdtlWp/ZA2jlBFi0q9ZrLo78oeHLAyn8gkZldJm
         eVdFrHwETW2dnl4/CO1CypDO5RlxlvSFu/z+WuMC7mK83w7wrLh5V/nOZBFa6o6wcc/v
         mtoYKaIrVB5+R/1n+UR89Uzf1fXZOhbDpcTG2nWVnimoGKpcaZ/m7yTSpbsuQ5FiBzOq
         /uZj0v2ivD/1w6UqhAIRkdHGdqZQC64eSDdh78Z1gmoZBRt/i5mVErzrqqYkbq+eRgy+
         aovLGR7mj6LVxTo5DNYns+Oea2CT4CStXr17lG7OxtWJxzAco/3vGkNE+3NxdkOanVzZ
         oJNw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272483; x=1776877283;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=;
        b=F0L1pgKh1MoWdbgoyCUV8pGFB5ESkRcjqdy9cZ1+/mHbJ7sVwALi20mOnFV2AFTih9
         GRUAEI5VzqCnNkQy2PQ3lQIw9W1ufU27lqelOsnzoL8B/iKMoGw6VenYPuXJX+9PJKME
         o7XAmB8n7hFodGdB+9SaGACoIm84uVjIyKDQ2pepfegGmFGy9YfAEzA12tOtGerclBKz
         v/fGDyTaOOMhsSZ3dLlUx5vUiaC+qYOEZlf9i02DiRVvumlZA4h1YCAonhksg5h6HmbK
         JC8pvzDzjI6ZRdLq3VqxA+qP2t/+VrHWOZm9Fat/xMJLI412Cet0QjfAbO+Dq11lhum2
         eFqA==
X-Gm-Message-State: AOJu0YyjgLZzz0UXYyd6Y+mgDOUL6I46122opLxR8jhI36sKdLfkU275
	8wXpnqOlpzsHpFIof08UT2kENAAGRtk1vmzXgbzpwnO87QJxMb+gWQuYUkvZSE7hacDEZgE8AkU
	dLMscVN8hHdUMCD3fGVKJ+8oTdJtsbBsMjYxuzlLCY5P677oF+DcL7mqVeslM+vm7+g==
X-Gm-Gg: AeBDietzFZuYTOTYXqmIaNgeOgrhR/DWVXjmGAY/oS25MQC2SDOZGKDm3A8juKBDvjF
	R1ZRrESB4vyzXSQaljF14c0mmFiIcqxCnvAyDosp8WHL3z2WOzVi5PgmbI2qNIh6eRYiOIuPua8
	I/9vJpCT8hdzXXmevSnDF0EZRs7Kwu17ok+lrOpzQSlV17c60mwgxZlFcRkTk+Oe80JLl+aolED
	RMzVPb6HlVEiQVahp2DbsFprCE1jo4wiA+Wiogmbv5+h2FAV98Gu0fKccuf/GzSJmW6Qtmz45Os
	5efPeqt3AT0wYIoyZmvh41uVjTeyayj03E96Y2/nRSQt1K5/hvpiMooPr6ntt+n+jSwpL4EUdtO
	8XBdXratVBy8XeMT6jOvUBiGW1uVobZntXa2W25uiWPz9HDoPywV42RVHye7Q5FMHeA==
X-Received: by 2002:a05:622a:10c:b0:50d:c69c:d01c with SMTP id
 d75a77b69052e-50dd5c3d6e0mr339846751cf.37.1776272482479;
        Wed, 15 Apr 2026 10:01:22 -0700 (PDT)
X-Received: by 2002:a05:622a:10c:b0:50d:c69c:d01c with SMTP id
 d75a77b69052e-50dd5c3d6e0mr339824561cf.37.1776272464530;
        Wed, 15 Apr 2026 10:01:04 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:04 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 2/7] ceph: use proper endian conversion for flock_len in
 reconnect
Date: Wed, 15 Apr 2026 17:00:38 +0000
Message-Id: <20260415170043.3882912-3-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Replace the __force __le32 cast with cpu_to_le32() for the flock_len field
in reconnect_caps_cb(). The old code used a type-system bypass to silence
sparse; the new form uses the proper endian conversion macro.

Also switch from a raw bitmask test against i_ceph_flags to test_bit() on
the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the
unsigned long flags field after the bit-position conversion.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/mds_client.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ccf0d53dde2b..871f0eef468d 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4693,8 +4693,9 @@ static int reconnect_caps_cb(struct inode *inode, int=
 mds, void *arg)
 		rec.v2.issued =3D cpu_to_le32(cap->issued);
 		rec.v2.snaprealm =3D cpu_to_le64(ci->i_snap_realm->ino);
 		rec.v2.pathbase =3D cpu_to_le64(path_info.vino.ino);
-		rec.v2.flock_len =3D (__force __le32)
-			((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1);
+		rec.v2.flock_len =3D cpu_to_le32(
+			test_bit(CEPH_I_ERROR_FILELOCK_BIT,
+				 &ci->i_ceph_flags) ? 0 : 1);
 	} else {
 		struct timespec64 ts;
=20
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92530346A08
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:06:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272821; cv=none;
 b=cUsg2E9X74ybFcQ1KmpeFxeo12dUVONTznroTpx+WPAz3tL+me7ZYuoq9af1mFYL4XgCVJ5uhTtC2hxrJrI7T0mz5doVQKUCMmTTDJiaf1oMM7veb/P7RmcepEoktojW4PjVAI5sQ25aFHkmJKsJH5mfeHU/HlVLTAUNy4lXBiY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272821; c=relaxed/simple;
	bh=xRN7giwH8NEcKW9yTVpwkdPtpHGQFSSZf19t6Lm+GmU=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=TXH7C5z0gzgXmE7aPUlYcAcOuFLTjtxAh/NyPXoUBnp9lRRHb5HPMzPgCTHL08s/3SFlWP8i8AscuO7o6C96i+SsgIdhKnYIoguXHoTndQlwB4qQVZepWFNbv/PjNDrGNhh5O/djMVLeFE80Ij123up0O8CzCR8nNsegt80pUIU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=eVIMcvpq;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=stxscCVa; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="eVIMcvpq";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="stxscCVa"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272818;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=;
	b=eVIMcvpqBhI8Es8Ar6G5Nh54DDYbGEEpwAnyZO0rDEg1nAyqFtJrwk79Ud6G796QVnBXHU
	BKPq7pHAZg7zBCcTv1cpjfJ291jiz9KqZZkUbqXydQ+Bj/3rzZKLHbGhNOYc5V9Xj9pUxS
	4t3uLm5QIePxDTySgf+gzLHYLnpYzVw=
Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com
 [209.85.214.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-137-fpDOANljPZ-HaCP3NtH6pw-1; Wed, 15 Apr 2026 13:06:57 -0400
X-MC-Unique: fpDOANljPZ-HaCP3NtH6pw-1
X-Mimecast-MFC-AGG-ID: fpDOANljPZ-HaCP3NtH6pw_1776272816
Received: by mail-pl1-f197.google.com with SMTP id
 d9443c01a7336-2b242062308so129539635ad.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:06:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272816; x=1776877616;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=;
        b=stxscCVak3JxpecO9DJ73BxhiJ+OMGPmZa8dEcr9ZLg4UwNkEL1G9Eo3JZ3FP42KXC
         HQU2Mc9rkB3Ds/IHJP4q/OIYwTx2ygVXt/KgG4GCIB+CR+5JiK2ALJQlN3ZiM3cF/FVz
         xd/MP6Uk6i89/cMjx2BzVk4TwjgxJfk3eugn2zdZFO5Sum5HuxHqkhnHh+zWGf2mRmfx
         GnO9YD1dQuYmWgYjR/6bb2bsD7Dh/r5sGifTEaFu6Oz8a5PpdigdYvUZRVGyilSE5ALn
         8QQRt3uSYKotCGpmPlRXuVi0QCko3SIF7xhpPLCa24eF9qRBxRdN7mL2GDekGKdVD2cr
         4lNw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272816; x=1776877616;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=;
        b=US2pzbIXErQtqyTG9WeZOs5hKMrbgJdvTxn7vLvKEgZhDRwzRep2QYxpxcKSZPhbC7
         4NJaOVoZ3Xv3BFP6HKcokaGgWfmP3VT6QBUVz1YAYxclD4QqFudcc5sThfMwlfzi0h5q
         ILQG5Wb2Tki0ISCYQW7CsUtJWQopu2VgOp8iGGxIh41FB5KI+4FEAIviPYRHSaZ9COsD
         ZQJu9TJvIlTTJvDTcPJXH1lCtVOZ0rtdH1Au6u78njanUu+pJHGeSlIkrJmpHzzDNLQt
         e4J68X+6ILwp4H1mRREowqgpHyXFYoF4TFQXcqH6mFQvJ7r7Q64iMxDegyhnw6GvUSTk
         VNQg==
X-Gm-Message-State: AOJu0YwBfHxvRFg5ITfU+hYSz+XwbATHPNJ35CxQtI+PkwSXeZkXy9U3
	dAYOnXxdzdZVuXS0tIqinyia/cyMz6Wtqi4xwtet1woXpFf7c6+aNccEiyoDfDK9o2JivbdOMKP
	Thrh3lM+O7y4coEZ2c1nppa9uu2CgyFsSGcGSFz0FJvstUJftT4OKWXwZ8L5hpS0hMZ9DCtJajT
	Br
X-Gm-Gg: AeBDiet7uPvb2rvFpiKf47GvW1T/nUtj4RcanKkdOhU0dHQNf+1Pmz6fvqc6FaUQM6M
	/cF2qk+R6yo3pxI0BcqfwFq48cdVgRem8g9BOW4xfZxuwaJ73iiRkoiSbt1vnqJXNJScERXFtPe
	24UMTjnSwjdU6Jlfl2nCmuJWq6V/zCGLAjpiEa9sESqOopU5fYMHLLjvB52qz6Ku4oC9Jz9O7fj
	BsKynGxyD1pgvDcvFdGZdWg36ufWh5kquC2gD3xmkp0LFV2gQafVdX81fbJUNNcC5ZTxx046XzW
	5J142HLkjrLfk5w1Tyxpe7IkrAhJyrk0IbIIUNAkd8dEuE4w8c3SS2AlruII+yMdBTXEzcYaooT
	3/1pbl2Ec5gjpAIeIMtjmSV3DVDBYfANsj/a/YfHmFWOSYjdkQKmFMDKbBK3CLdIdYQ==
X-Received: by 2002:a17:903:2290:b0:2b4:59d4:9a with SMTP id
 d9443c01a7336-2b459d403abmr150753385ad.2.1776272815618;
        Wed, 15 Apr 2026 10:06:55 -0700 (PDT)
X-Received: by 2002:a05:622a:5c0d:b0:50d:7304:1770 with SMTP id
 d75a77b69052e-50dd5a83830mr316724691cf.8.1776272467092;
        Wed, 15 Apr 2026 10:01:07 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:06 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 3/7] ceph: harden send_mds_reconnect and handle active-MDS
 peer reset
Date: Wed, 15 Apr 2026 17:00:39 +0000
Message-Id: <20260415170043.3882912-4-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Change send_mds_reconnect() to return an error code so callers can detect
and report reconnect failures instead of silently ignoring them. Add early
bailout checks for sessions that are already closed, rejected, or
unregistered, which avoids sending reconnect messages for sessions that
can no longer be recovered.

The early -ESTALE and -ENOENT bailouts use a separate fail_return label
that skips the pr_err_client diagnostic, since these codes indicate
expected concurrent-teardown races rather than genuine reconnect build
failures.

Save the prior session state before transitioning to RECONNECTING,
and restore it in the failure path.  Without this, a transient
build or encoding failure (-ENOMEM, -ENOSPC) strands the session
in RECONNECTING indefinitely because check_new_map() only retries
sessions in RESTARTING state.

Rewrite mds_peer_reset() to handle the case where the MDS is past its
RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
messages because it only accepts them during its own RECONNECT window
after restart. Previously, the client would send a doomed reconnect
that the MDS would reject or ignore. Now, the client tears the session
down locally and lets new requests re-open a fresh session, which is
the correct recovery for this scenario. The RECONNECTING state is
handled on the same teardown path, since the MDS will reject reconnect
attempts from an active client regardless of the session's local state.

The session teardown path in mds_peer_reset() follows the established
drop-and-reacquire locking pattern from check_new_map(): take
mdsc->mutex for session unregistration, release it, then take s->s_mutex
separately for cleanup. This avoids introducing a new simultaneous lock
nesting pattern.

Log reconnect failures from check_new_map() and mds_peer_reset() at
pr_warn level rather than pr_err, since return codes like -ESTALE
(closed/rejected session) and -ENOENT (unregistered session) are
expected during concurrent teardown. Log dropped messages for
unregistered sessions via doutc() (dynamic debug) rather than
pr_info, as post-reset message arrival is routine and does not
warrant unconditional logging.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/mds_client.c | 163 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 151 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 871f0eef468d..b14ede808436 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *=
session,
 		break;
=20
 	case CEPH_SESSION_REJECT:
-		WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING);
-		pr_info_client(cl, "mds%d rejected session\n",
-			       session->s_mds);
+		WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING &&
+			session->s_state !=3D CEPH_MDS_SESSION_RECONNECTING);
+		if (session->s_state =3D=3D CEPH_MDS_SESSION_RECONNECTING)
+			pr_info_client(cl, "mds%d reconnect rejected\n",
+				       session->s_mds);
+		else
+			pr_info_client(cl, "mds%d rejected session\n",
+				       session->s_mds);
 		session->s_state =3D CEPH_MDS_SESSION_REJECTED;
 		cleanup_session_requests(mdsc, session);
 		remove_session_caps(session);
@@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, in=
t mds, void *arg)
 	cap->mseq =3D 0;       /* and migrate_seq */
 	cap->cap_gen =3D atomic_read(&cap->session->s_cap_gen);
=20
+	/*
+	 * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
+	 * Instead, locks are submitted for best-effort MDS reclaim
+	 * via the flock_len field below.  If reclaim fails (e.g.,
+	 * another client grabbed a conflicting lock), future lock
+	 * operations will fail and set the error flag at that point.
+	 */
+
 	/* These are lost when the session goes away */
 	if (S_ISDIR(inode->i_mode)) {
 		if (cap->issued & CEPH_CAP_DIR_CREATE) {
@@ -4892,13 +4905,14 @@ static int encode_snap_realms(struct ceph_mds_clien=
t *mdsc,
  *
  * This is a relatively heavyweight operation, but it's rare.
  */
-static void send_mds_reconnect(struct ceph_mds_client *mdsc,
-			       struct ceph_mds_session *session)
+static int send_mds_reconnect(struct ceph_mds_client *mdsc,
+			      struct ceph_mds_session *session)
 {
 	struct ceph_client *cl =3D mdsc->fsc->client;
 	struct ceph_msg *reply;
 	int mds =3D session->s_mds;
 	int err =3D -ENOMEM;
+	int old_state;
 	struct ceph_reconnect_state recon_state =3D {
 		.session =3D session,
 	};
@@ -4917,6 +4931,31 @@ static void send_mds_reconnect(struct ceph_mds_clien=
t *mdsc,
 	xa_destroy(&session->s_delegated_inos);
=20
 	mutex_lock(&session->s_mutex);
+	if (session->s_state =3D=3D CEPH_MDS_SESSION_CLOSED ||
+	    session->s_state =3D=3D CEPH_MDS_SESSION_REJECTED) {
+		pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
+			       mds,
+			       ceph_session_state_name(session->s_state));
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err =3D -ESTALE;
+		goto fail_return;
+	}
+
+	mutex_lock(&mdsc->mutex);
+	if (mds >=3D mdsc->max_sessions || mdsc->sessions[mds] !=3D session) {
+		mutex_unlock(&mdsc->mutex);
+		pr_info_client(cl,
+			       "mds%d skipping reconnect, session unregistered\n",
+			       mds);
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err =3D -ENOENT;
+		goto fail_return;
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	old_state =3D session->s_state;
 	session->s_state =3D CEPH_MDS_SESSION_RECONNECTING;
 	session->s_seq =3D 0;
=20
@@ -5046,18 +5085,34 @@ static void send_mds_reconnect(struct ceph_mds_clie=
nt *mdsc,
=20
 	up_read(&mdsc->snap_rwsem);
 	ceph_pagelist_release(recon_state.pagelist);
-	return;
+	return 0;
=20
 fail:
 	ceph_msg_put(reply);
 	up_read(&mdsc->snap_rwsem);
+	/*
+	 * Restore prior session state so map-driven reconnect logic
+	 * (check_new_map) can retry.  Without this, a transient build
+	 * failure strands the session in RECONNECTING indefinitely.
+	 */
+	session->s_state =3D old_state;
 	mutex_unlock(&session->s_mutex);
 fail_nomsg:
 	ceph_pagelist_release(recon_state.pagelist);
 fail_nopagelist:
 	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
 		      err, mds);
-	return;
+	return err;
+
+fail_return:
+	/*
+	 * Early-exit path for expected concurrent-teardown races
+	 * (-ESTALE for closed/rejected sessions, -ENOENT for
+	 * unregistered sessions).  Skip the pr_err_client diagnostic
+	 * since these are not genuine reconnect build failures.
+	 */
+	ceph_pagelist_release(recon_state.pagelist);
+	return err;
 }
=20
=20
@@ -5138,9 +5193,15 @@ static void check_new_map(struct ceph_mds_client *md=
sc,
 		 */
 		if (s->s_state =3D=3D CEPH_MDS_SESSION_RESTARTING &&
 		    newstate >=3D CEPH_MDS_STATE_RECONNECT) {
+			int rc;
+
 			mutex_unlock(&mdsc->mutex);
 			clear_bit(i, targets);
-			send_mds_reconnect(mdsc, s);
+			rc =3D send_mds_reconnect(mdsc, s);
+			if (rc)
+				pr_warn_client(cl,
+					       "mds%d reconnect failed: %d\n",
+					       i, rc);
 			mutex_lock(&mdsc->mutex);
 		}
=20
@@ -5204,7 +5265,11 @@ static void check_new_map(struct ceph_mds_client *md=
sc,
 		}
 		doutc(cl, "send reconnect to export target mds.%d\n", i);
 		mutex_unlock(&mdsc->mutex);
-		send_mds_reconnect(mdsc, s);
+		err =3D send_mds_reconnect(mdsc, s);
+		if (err)
+			pr_warn_client(cl,
+				       "mds%d export target reconnect failed: %d\n",
+				       i, err);
 		ceph_put_mds_session(s);
 		mutex_lock(&mdsc->mutex);
 	}
@@ -6284,12 +6349,84 @@ static void mds_peer_reset(struct ceph_connection *=
con)
 {
 	struct ceph_mds_session *s =3D con->private;
 	struct ceph_mds_client *mdsc =3D s->s_mdsc;
+	int session_state;
=20
 	pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
 		       s->s_mds);
-	if (READ_ONCE(mdsc->fsc->mount_state) !=3D CEPH_MOUNT_FENCE_IO &&
-	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >=3D CEPH_MDS_STATE_REC=
ONNECT)
-		send_mds_reconnect(mdsc, s);
+
+	if (READ_ONCE(mdsc->fsc->mount_state) =3D=3D CEPH_MOUNT_FENCE_IO ||
+	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONN=
ECT)
+		return;
+
+	if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) =3D=3D CEPH_MDS_STATE_R=
ECONNECT) {
+		int rc =3D send_mds_reconnect(mdsc, s);
+
+		if (rc)
+			pr_warn_client(mdsc->fsc->client,
+				       "mds%d reconnect failed: %d\n",
+				       s->s_mds, rc);
+		return;
+	}
+
+	/*
+	 * MDS is active (past RECONNECT).  It will not accept a
+	 * CLIENT_RECONNECT from us, so tear the session down locally
+	 * and let new requests re-open a fresh session.
+	 *
+	 * Snapshot session state with READ_ONCE, then revalidate under
+	 * mdsc->mutex before acting.  The subsequent mdsc->mutex
+	 * section rechecks s_state to catch concurrent transitions, so
+	 * the lockless snapshot here is safe.  s->s_mutex is taken
+	 * separately for cleanup after unregistration, which avoids
+	 * introducing a new s->s_mutex + mdsc->mutex nesting.
+	 */
+	session_state =3D READ_ONCE(s->s_state);
+
+	switch (session_state) {
+	case CEPH_MDS_SESSION_RESTARTING:
+	case CEPH_MDS_SESSION_RECONNECTING:
+	case CEPH_MDS_SESSION_CLOSING:
+	case CEPH_MDS_SESSION_OPEN:
+	case CEPH_MDS_SESSION_HUNG:
+	case CEPH_MDS_SESSION_OPENING:
+		mutex_lock(&mdsc->mutex);
+		if (s->s_mds >=3D mdsc->max_sessions ||
+		    mdsc->sessions[s->s_mds] !=3D s ||
+		    s->s_state !=3D session_state) {
+			pr_info_client(mdsc->fsc->client,
+				       "mds%d state changed to %s during peer reset\n",
+				       s->s_mds,
+				       ceph_session_state_name(s->s_state));
+			mutex_unlock(&mdsc->mutex);
+			return;
+		}
+
+		ceph_get_mds_session(s);
+		s->s_state =3D CEPH_MDS_SESSION_CLOSED;
+		__unregister_session(mdsc, s);
+		__wake_requests(mdsc, &s->s_waiting);
+		mutex_unlock(&mdsc->mutex);
+
+		mutex_lock(&s->s_mutex);
+		cleanup_session_requests(mdsc, s);
+		remove_session_caps(s);
+		mutex_unlock(&s->s_mutex);
+
+		wake_up_all(&mdsc->session_close_wq);
+
+		mutex_lock(&mdsc->mutex);
+		kick_requests(mdsc, s->s_mds);
+		mutex_unlock(&mdsc->mutex);
+
+		ceph_put_mds_session(s);
+		break;
+	default:
+		pr_warn_client(mdsc->fsc->client,
+			       "mds%d peer reset in unexpected state %s\n",
+			       s->s_mds,
+			       ceph_session_state_name(session_state));
+		break;
+	}
 }
=20
 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
@@ -6301,6 +6438,8 @@ static void mds_dispatch(struct ceph_connection *con,=
 struct ceph_msg *msg)
=20
 	mutex_lock(&mdsc->mutex);
 	if (__verify_registered_session(mdsc, s) < 0) {
+		doutc(cl, "dropping tid %llu from unregistered session %d\n",
+		      le64_to_cpu(msg->hdr.tid), s->s_mds);
 		mutex_unlock(&mdsc->mutex);
 		goto out;
 	}
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 708FB3E51ED
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272477; cv=none;
 b=nnkv4uCHMSmqjWUGszQ6BqrabWOvoso87Se0BzxE20qolp0Xos0p7d0R36innczLviYUhQLQbbbUiek0BVvHiaTtHE0FA5jpIscQQVIbHFR9z/g2pHv8LSCpWUxl2yGKPv0AHwHUvbMZr/eMcqJp+u1w+CJ0S353qgO1xen3ZhQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272477; c=relaxed/simple;
	bh=+CTSEsp/2olU8L6a/rv052YQpe4i7E93BUYU+16kL6E=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=LEL0/qrMjj5aMOpxKjOSQN2jtNSuYFmNbt51uEZDHcTAbRGfiu2rcqwJe6q8zCpjZidOimDsZnTYYvx/FN/pJbjXhxiwO/Z/c/RvDZZzL5scvH6Q/fvFAXsGLJAAUDu+CydJl8XduxMrIKb7iYZFVyiTo0owMlUhy8u/a3nyoUs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TcKPk9V4;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Mq5tueLE; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TcKPk9V4";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Mq5tueLE"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272474;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=;
	b=TcKPk9V47vJ/cqMfsTxqGiduzX3F8XfIe2gcTmXJpTKzBgfd62s0Dmq0WAZ5A2mgjQECol
	UmvGmw5vXHfnvQCItpUQk65AW/S5nwIxqWYKafQSrSwq1CeLzfgB8S+Z4mBCAccPW8cBn3
	6SkFILo9LjrUMLHbgRNugLDjdhD96i4=
Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com
 [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-283-jsF3jEvxOByCU_cPqj1ecQ-1; Wed, 15 Apr 2026 13:01:13 -0400
X-MC-Unique: jsF3jEvxOByCU_cPqj1ecQ-1
X-Mimecast-MFC-AGG-ID: jsF3jEvxOByCU_cPqj1ecQ_1776272472
Received: by mail-qt1-f200.google.com with SMTP id
 d75a77b69052e-50d76f460b2so182208511cf.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272472; x=1776877272;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=;
        b=Mq5tueLEdfHrlPkFgPdOfYmUFNxCg0HAfslBQ7KLWFA0lsKVej7X3wCv3suHT6/1+w
         +uAWH214QOjr1sFdLrqOUClsc1ncaBRy4DLJgN64US1wj9v8IlapTXdnw9oVhr+dwhz5
         NXCbJ/feEZwhvRZpqXKu9PTGVoRj25tojae9FswuwLTqbqZh3dFGoa4qcisHc0p+O7t7
         6NbOG9NvAr8HLIjFiq0mfxO8do8osiuSR8XNttU2JaPI96oNl6whz8pj3YoQKleg73Ch
         iGwTwNJv5c+iPpNIaxfFzEqoTjBJKLf5NNZXe7CMiz/7CEPmQPDcnkrYZu4I68ZpgrZy
         P45Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272472; x=1776877272;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=;
        b=dVYXUGMnkfUXWtmmNqCPdoefmAQ85R41IFf/6eBq7AWPdthZGNkUKSnM1l7+NhvcpJ
         rViUx7IR6RKvshhDQcOER/PWZJtnTKhnQdVP6mgia3lk9qjPoc+N7R0x6Nuxhm/4nvWr
         5frTSFpjcf7l3Hd2msl8n0pYYvOMoJkMKjLkLbapjSe7v/Yy45IJN6fEx8ammBYlb2JO
         D/ZEcz4lbM4QaWQg4hlcI8eNfpNHhnU4oGem6wnCfYaOY8z4LmoN20VuW2EnBJG89FSv
         zdkTqR4pWWtKJOe3U4rDhEL+ipC3W+VaVk+ZzH2i977lMvlf/FmdJtGVI3+78xfBkl7v
         nNUQ==
X-Gm-Message-State: AOJu0YzVh8S332V9sZ2J2bjVIP5hFKlMbHCNcd0rG/1lRQQLk1k1O3jG
	6VSmiW8d+irBVyuoVIdDVLSt908AvjOjhwPEXbZomXbaU7DJDHldOcZqOhclTYb8cJBXJX5fdua
	mIcNxSHn0d6R3wknka+Gr2L2p1n9LZfDYqKFEPoNq7lv5lFVZvMdRFCHE9DaNrlScPQ==
X-Gm-Gg: AeBDieuJ2xZqyzSPtddF+wksE4DchoxfteDDSrQotOB7VvjHRqeAF7c0q0liQabmyiy
	tW5CUrDkYyR5iP/tJWU3FDw4NFKUrDd8WFqZkaySvKx2pOn872JVEMyd4/My5m+5mz7N1nl6Hpz
	SabPQ9Efe8m00qtPJCJQhdOLwC3A7FMf8zAg5Y6TBeX0AdXOsgrS5LRLHIgymwbC2vNQxS5btzs
	uqmBO+2Y9qmMMR0MwYyA2oLdycDm7twNFmeTRjEsB0zC5J7ZHQD+JUZ+MjoaDvCWlQnAG/ZQ2JZ
	7jtEWlMfyF+lO+ERYDLyVdYMtpsbGo45o/huFIX7+WdBOocHgtosy6QXI1Nd43O6yJaGZN0HVJ/
	sokdv3blQXaBa5GjKpT0DlIz5zEepV8jQeJ6Wr+8oBBW5HOv1xxoA9E4PeVUspOX6zw==
X-Received: by 2002:a05:622a:1b1e:b0:50d:38d5:c6b1 with SMTP id
 d75a77b69052e-50dd5b526bamr327474031cf.16.1776272472265;
        Wed, 15 Apr 2026 10:01:12 -0700 (PDT)
X-Received: by 2002:a05:622a:1b1e:b0:50d:38d5:c6b1 with SMTP id
 d75a77b69052e-50dd5b526bamr327470971cf.16.1776272470063;
        Wed, 15 Apr 2026 10:01:10 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.08
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:09 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 4/7] ceph: add diagnostic timeout loop to wait_caps_flush()
Date: Wed, 15 Apr 2026 17:00:40 +0000
Message-Id: <20260415170043.3882912-5-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
wait loop that periodically dumps pending cap flush state.

The underlying wait semantics remain intact: callers still wait until the
requested cap flushes complete. The difference is that long stalls now
produce actionable diagnostics instead of looking like a silent hang.

CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways:
it limits the number of entries emitted per diagnostic dump, and it
limits the number of timed diagnostic dumps before the wait continues
silently.

READ_ONCE is used for the i_last_cap_flush_ack field, which is read
outside the inode lock domain.

Add a ci pointer to struct ceph_cap_flush so that the diagnostic
dump can identify which inode each pending flush belongs to.  The
new i_last_cap_flush_ack field tracks the latest acknowledged flush
tid per inode for diagnostic correlation.

This improves reset-drain observability and is also useful for
existing sync and writeback troubleshooting paths.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/caps.c       |  5 ++++
 fs/ceph/inode.c      |  1 +
 fs/ceph/mds_client.c | 56 ++++++++++++++++++++++++++++++++++++++++----
 fs/ceph/super.h      |  6 +++++
 4 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index cb9e78b713d9..c40175dd77ae 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info=
 *ci,
=20
 		spin_lock(&mdsc->cap_dirty_lock);
 		capsnap->cap_flush.tid =3D ++mdsc->last_cap_flush_tid;
+		capsnap->cap_flush.ci =3D ci;
 		list_add_tail(&capsnap->cap_flush.g_list,
 			      &mdsc->cap_flush_list);
 		if (oldest_flush_tid =3D=3D 0)
@@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
 		return NULL;
=20
 	cf->is_capsnap =3D false;
+	cf->ci =3D NULL;
 	return cf;
 }
=20
@@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
 	doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
=20
 	swap(cf, ci->i_prealloc_cap_flush);
+	cf->ci =3D ci;
 	cf->caps =3D flushing;
 	cf->wake =3D wake;
=20
@@ -3826,6 +3829,8 @@ static void handle_cap_flush_ack(struct inode *inode,=
 u64 flush_tid,
 	bool wake_ci =3D false;
 	bool wake_mdsc =3D false;
=20
+	WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
+
 	list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
 		/* Is this the one that was flushed? */
 		if (cf->tid =3D=3D flush_tid)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f75d66760d54..de465c7e96e8 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	INIT_LIST_HEAD(&ci->i_cap_snaps);
 	ci->i_head_snapc =3D NULL;
 	ci->i_snap_caps =3D 0;
+	ci->i_last_cap_flush_ack =3D 0;
=20
 	ci->i_last_rd =3D ci->i_last_wr =3D jiffies - 3600 * HZ;
 	for (i =3D 0; i < CEPH_FILE_MODE_BITS; i++)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b14ede808436..7d17332d72d7 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -27,6 +27,8 @@
 #include <trace/events/ceph.h>
=20
 #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
+#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
+#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5
=20
 /*
  * A cluster of MDS (metadata server) daemons is responsible for
@@ -2286,19 +2288,65 @@ static int check_caps_flush(struct ceph_mds_client =
*mdsc,
 }
=20
 /*
- * flush all dirty inode data to disk.
+ * Dump pending cap flushes for diagnostic purposes.
  *
- * returns true if we've flushed through want_flush_tid
+ * cf->ci is safe to dereference here because the cap_dirty_lock is
+ * held, and cap_flush entries are removed from the global
+ * cap_flush_list under the same lock in the purge path.
+ */
+static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
+{
+	struct ceph_client *cl =3D mdsc->fsc->client;
+	struct ceph_cap_flush *cf;
+	int dumped =3D 0;
+
+	pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
+		       want_tid);
+	spin_lock(&mdsc->cap_dirty_lock);
+	list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
+		if (cf->tid > want_tid)
+			break;
+		if (++dumped > CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
+			break;
+		if (!cf->ci) {
+			pr_info_client(cl,
+				       "(null ci) %s tid=3D%llu wake=3D%d%s\n",
+				       ceph_cap_string(cf->caps), cf->tid,
+				       cf->wake,
+				       cf->is_capsnap ? " is_capsnap" : "");
+			continue;
+		}
+		pr_info_client(cl,
+			       "%llx:%llx %s tid=3D%llu last_ack=3D%llu wake=3D%d%s\n",
+			       ceph_vinop(&cf->ci->netfs.inode),
+			       ceph_cap_string(cf->caps), cf->tid,
+			       READ_ONCE(cf->ci->i_last_cap_flush_ack),
+			       cf->wake,
+			       cf->is_capsnap ? " is_capsnap" : "");
+	}
+	spin_unlock(&mdsc->cap_dirty_lock);
+}
+
+/*
+ * Wait for all cap flushes through @want_flush_tid to complete.
+ * Periodically dumps pending cap flush state for diagnostics.
  */
 static void wait_caps_flush(struct ceph_mds_client *mdsc,
 			    u64 want_flush_tid)
 {
 	struct ceph_client *cl =3D mdsc->fsc->client;
+	int i =3D 0;
+	long ret;
=20
 	doutc(cl, "want %llu\n", want_flush_tid);
=20
-	wait_event(mdsc->cap_flushing_wq,
-		   check_caps_flush(mdsc, want_flush_tid));
+	do {
+		ret =3D wait_event_timeout(mdsc->cap_flushing_wq,
+			   check_caps_flush(mdsc, want_flush_tid),
+			   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
+		if (ret =3D=3D 0 && i++ < CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
+			dump_cap_flushes(mdsc, want_flush_tid);
+	} while (ret =3D=3D 0);
=20
 	doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index c89ad8dcc969..1f901b1647e6 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -238,6 +238,7 @@ struct ceph_cap_flush {
 	bool is_capsnap; /* true means capsnap */
 	struct list_head g_list; // global
 	struct list_head i_list; // per inode
+	struct ceph_inode_info *ci;
 };
=20
 /*
@@ -443,6 +444,11 @@ struct ceph_inode_info {
 	struct ceph_snap_context *i_head_snapc;  /* set if wr_buffer_head > 0 or
 						    dirty|flushing caps */
 	unsigned i_snap_caps;           /* cap bits for snapped files */
+	/*
+	 * Written under i_ceph_lock, read via READ_ONCE()
+	 * from diagnostic paths.
+	 */
+	u64 i_last_cap_flush_ack;
=20
 	unsigned long i_last_rd;
 	unsigned long i_last_wr;
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E80863E5EEA
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272509; cv=none;
 b=iEr+Efql6G2KqDzMHmvAFBzRHa72ItUthwYc+q3EFfKqk3nH6e8CiYV05Fe/roInBfhQUNrQJO2LjyhSVQ9mhFZCMrZxlr3UKDsHQsyfQ/Hu2bb/d0BoMIHVCKWIJ2GIO1rHdOAOdcJeAcRokSBh6F6fnP+QLUHkRAkZYWUicA0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272509; c=relaxed/simple;
	bh=nC4dvPOz/UMoWDa230ND3c0qOR1XY9s4Iu3fbaIELHk=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=NDITNcM0wUGoiFqYkPxYJi5RRxmmEBJdrJAJDl+RcaT2ySvOm23T6s0ZDx7lOklFCyENsMqenAohEisyD30o947j66mYd0ApWevTNbJridbG3vFAH6+z3jRjL6HUEa/yu1DBNF609koRtmc9Bg9TB53A/aDChVquTgKVn9V7QTk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=gIiwmfRV;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=bM5eJI3X; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="gIiwmfRV";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="bM5eJI3X"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272505;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=;
	b=gIiwmfRVRzyKrwgIV3hx7lqHxR80eaAMzuJqLqcABc0kzweSr0/ECmwDNgJYKYZp0xCoz/
	7fMnTZp03QDknl37Nb0zLfTj+xU3VER6joEfYbX/MozLLmR9xpOvdnDiOkDZONdn84HqY4
	9EQ+R836decR5rc61aG0Kp/G7JU7XeA=
Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com
 [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-692-y5-XxvLiNmm5foY2EKsQNg-1; Wed, 15 Apr 2026 13:01:39 -0400
X-MC-Unique: y5-XxvLiNmm5foY2EKsQNg-1
X-Mimecast-MFC-AGG-ID: y5-XxvLiNmm5foY2EKsQNg_1776272499
Received: by mail-qt1-f200.google.com with SMTP id
 d75a77b69052e-50d8e4c29caso177086071cf.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272499; x=1776877299;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=;
        b=bM5eJI3X0jjHsDYR93XUwkOIF94fUtdpvpd4DAvoSHHyFO3cmY77BVws2V4kQZKbZ/
         k7sl9RHXFeIO+Rkm7KF6FNx/ROZf2L08LEY3IoO/vRhML1RWyd2kqOUn9r+bMssGQWK8
         nU3lml279iITHvqjGksNvYOYPWjZ9dZElZRg3u6zqchSrRM2w7NGRQLm3RL+vhs8u5pa
         4qKxo6GfcYo96FMss5ay6+haXiceqYimyz2fYvhxlUBaLRRKVSFa520y6F9uFUpWJuUL
         kHjCnVRrTLJnXjzrb30IUZ4SX4z9jaBRRvP/PTBMtk8VRw4Z5TlRNy2Uk35Sv96Ja8mf
         ibwA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272499; x=1776877299;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=;
        b=LzjG3ADP07Q7y3daT1BivRBfc1fyZcgZ7VfiptguN7DPYi7+1q+sMAoWXOtnCmwyiw
         KpLHDdG7LNzZBNHE2uKbZf9TW6kIQB/IIkcWeLR0WNkELXbteFDcusBfncPIdVbB2cY0
         b8vEkM/ixI2e8KxfXXO2PqcjQbRuWo/Oi8PnOrhMl/nRI3E7ulItedNMu/z/erAmfGmn
         nCZu6KyFyYBJsYRxRJ0D38lceg+g2injvwWTG3MmPyoLoX4+JX96PMbATTwVU596oKmI
         tyV8B+C4my9gipUeQMEDhtn1tFxVo4tiim8nolo3FZM0s4qgNqwcCt4EuJpfLPQZcwbt
         Hdcw==
X-Gm-Message-State: AOJu0Yyy7mc1hfzws7OzShQbjrhQkt4KKLRqAplz9BkWucglKvVMZ+7e
	r0UJovVDLS7qtOn/BH+u3KNRdDzw/c65AQLUzVTU4KUjJCMlxwUSaE8NpotaWm5QhkWVjnjK5zg
	ag52dEFZFrjNM6WolW0+pL1DttmvTv50fWV4Ppzm4xsRL8/lxa4QH0q/pTgrGC4GDxg==
X-Gm-Gg: AeBDiet39WymquYg1uZ7siC7fnqHLYzH3qKU+xo4mw9TRG8qNYg50gz7k3GnuUH5Sup
	3rHTL1AIr3oZDDqUsIlRHoa8WZd+q2Mxr/2ejQndpRVFDQFYIZ8BeM7Wnu6R+xBD5Ll7UXtJZvD
	FbE05W1lqw4ZZJV6bFtdnqYZAEW70osUkafTgYweX/WmWXjV6Z4Daqebn8Def3Bn7pxfr5CtZUA
	YcMSi5tyrIGVwwsdC/DoQ0jUkDs6DHIHjDhuIk9wp9tkuuSeaFwDrXNios1n64xcSyKZFJ/xZfg
	5BtXQyWjkJzf4eoudE/WWZIoC4HeB0kLpkBqbPH8I8sfyVm2wC7cH2s1CPKsLIn/BZz2zhOTRdu
	T+2hkJ3ajZQSdRkJ6/enCmzdyZazamchjzhis+U8tX6IlWjsHTVBzOfwE3b7WtlSjHg==
X-Received: by 2002:a05:622a:1a8e:b0:50d:7b0c:35e7 with SMTP id
 d75a77b69052e-50dd5c6cd3amr330619191cf.43.1776272491969;
        Wed, 15 Apr 2026 10:01:31 -0700 (PDT)
X-Received: by 2002:a05:622a:1a8e:b0:50d:7b0c:35e7 with SMTP id
 d75a77b69052e-50dd5c6cd3amr330599741cf.43.1776272472957;
        Wed, 15 Apr 2026 10:01:12 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:12 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 5/7] ceph: add client reset state machine and session
 teardown
Date: Wed, 15 Apr 2026 17:00:41 +0000
Message-Id: <20260415170043.3882912-6-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add the client-side reset state machine, request gating, and manual
session teardown implementation.

Manual reset is an operator-triggered escape hatch for client/MDS
stalemates in which caps, locks, or unsafe metadata state stop making
forward progress.  The reset blocks new metadata work, attempts a
bounded best-effort drain of dirty client state while sessions are
still alive, and finally asks the MDS to close sessions before tearing
local session state down directly.

The reset state machine tracks four phases: IDLE -> QUIESCING ->
DRAINING -> TEARDOWN -> IDLE.  QUIESCING is set synchronously by
schedule_reset() before the workqueue item is dispatched, so that new
metadata requests and file-lock acquisitions are gated immediately --
even before the work function begins running.  All non-IDLE phases
block callers on blocked_wq, preventing races with session teardown.

The drain phase flushes mdlog state, dirty caps, and pending cap
releases for a bounded interval.  State that still cannot make progress
within that interval is discarded during teardown, which is the point
of the reset: break the stalemate and allow fresh sessions to rebuild
clean state.

The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then clean
up caps and requests under s->s_mutex.  Reconnect is not attempted
because the MDS only accepts reconnects during its own RECONNECT phase
after restart, not from an active client.

Blocked callers are released when reset completes and observe the final
result.  The destroy path marks reset as failed and wakes blocked
waiters before cancel_work_sync() so unmount does not stall.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/locks.c      |  16 ++
 fs/ceph/mds_client.c | 421 +++++++++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.h |  42 +++++
 3 files changed, 479 insertions(+)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index c4ff2266bb94..677221bd64e0 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_l=
ock *fl)
 {
 	struct inode *inode =3D file_inode(file);
 	struct ceph_inode_info *ci =3D ceph_inode(inode);
+	struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl =3D ceph_inode_to_client(inode);
 	int err =3D 0;
 	u16 op =3D CEPH_MDS_OP_SETFILELOCK;
@@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_=
lock *fl)
 		return -EIO;
 	}
=20
+	/* Wait for reset to complete before acquiring new locks */
+	if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
+		err =3D ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (lock_is_read(fl))
 		lock_cmd =3D CEPH_LOCK_SHARED;
 	else if (lock_is_write(fl))
@@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_=
lock *fl)
 {
 	struct inode *inode =3D file_inode(file);
 	struct ceph_inode_info *ci =3D ceph_inode(inode);
+	struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl =3D ceph_inode_to_client(inode);
 	int err =3D 0;
 	u8 wait =3D 0;
@@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file=
_lock *fl)
 		return -EIO;
 	}
=20
+	/* Wait for reset to complete before acquiring new locks */
+	if (!lock_is_unlock(fl)) {
+		err =3D ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (IS_SETLKW(cmd))
 		wait =3D 1;
=20
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 7d17332d72d7..7e399b0dcc55 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
+#include <linux/delay.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
@@ -67,6 +68,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
 			    struct list_head *head);
 static void ceph_cap_release_work(struct work_struct *work);
 static void ceph_cap_reclaim_work(struct work_struct *work);
+static void ceph_mdsc_reset_workfn(struct work_struct *work);
=20
 static const struct ceph_connection_operations mds_con_ops;
=20
@@ -3756,6 +3758,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client =
*mdsc, struct inode *dir,
 	struct ceph_client *cl =3D mdsc->fsc->client;
 	int err =3D 0;
=20
+	/*
+	 * If a reset is in progress, wait for it to complete.
+	 *
+	 * This is best-effort: a request can pass this check just
+	 * before the phase leaves IDLE and proceed concurrently with
+	 * reset.  That is acceptable because (a) such requests will
+	 * either complete normally or fail and be retried by the
+	 * caller, and (b) adding lock serialization here would
+	 * penalize every request for a rare manual operation.
+	 */
+	err =3D ceph_mdsc_wait_for_reset(mdsc);
+	if (err) {
+		doutc(cl, "wait_for_reset failed: %d\n", err);
+		return err;
+	}
+
 	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
 	if (req->r_inode)
 		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
@@ -5163,6 +5181,387 @@ static int send_mds_reconnect(struct ceph_mds_clien=
t *mdsc,
 	return err;
 }
=20
+const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase)
+{
+	switch (phase) {
+	case CEPH_CLIENT_RESET_IDLE:	  return "idle";
+	case CEPH_CLIENT_RESET_QUIESCING: return "quiescing";
+	case CEPH_CLIENT_RESET_DRAINING:  return "draining";
+	case CEPH_CLIENT_RESET_TEARDOWN:  return "teardown";
+	default:			  return "unknown";
+	}
+}
+
+/*
+ * Wait for an active reset to complete.
+ * Returns 0 if reset completed successfully or no reset was active.
+ * Returns -ETIMEDOUT if we timed out waiting.
+ * Returns -ERESTARTSYS if interrupted by signal.
+ */
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
+{
+	struct ceph_client_reset_state *st =3D &mdsc->reset_state;
+	struct ceph_client *cl =3D mdsc->fsc->client;
+	unsigned long deadline =3D jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC *=
 HZ;
+	int blocked_count;
+	long wait_ret;
+	int ret;
+
+	if (READ_ONCE(st->phase) =3D=3D CEPH_CLIENT_RESET_IDLE)
+		return 0;
+
+	blocked_count =3D atomic_inc_return(&st->blocked_requests);
+	doutc(cl, "request blocked during reset, %d total blocked\n",
+	      blocked_count);
+
+retry:
+	wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq,
+						    READ_ONCE(st->phase) =3D=3D
+						     CEPH_CLIENT_RESET_IDLE,
+						    deadline - jiffies);
+
+	if (wait_ret =3D=3D 0) {
+		atomic_dec(&st->blocked_requests);
+		pr_warn_client(cl, "timed out waiting for reset to complete\n");
+		return -ETIMEDOUT;
+	}
+	if (wait_ret < 0) {
+		atomic_dec(&st->blocked_requests);
+		return (int)wait_ret;  /* -ERESTARTSYS */
+	}
+
+	/*
+	 * Verify phase is still IDLE under the lock.  If another reset
+	 * was scheduled between the wake-up and this check, loop back
+	 * and wait for it to finish rather than returning a stale result.
+	 */
+	spin_lock(&st->lock);
+	if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) {
+		spin_unlock(&st->lock);
+		if (time_before(jiffies, deadline))
+			goto retry;
+		atomic_dec(&st->blocked_requests);
+		return -ETIMEDOUT;
+	}
+	ret =3D st->last_errno;
+	spin_unlock(&st->lock);
+
+	atomic_dec(&st->blocked_requests);
+	return ret;
+}
+
+static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
+{
+	struct ceph_client_reset_state *st =3D &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	/*
+	 * If destroy already marked us as shut down, it owns the
+	 * final bookkeeping.  Just bail so we don't overwrite the
+	 * -ESHUTDOWN result that waiters already observed.
+	 */
+	if (st->shutdown) {
+		spin_unlock(&st->lock);
+		return;
+	}
+	st->last_finish =3D jiffies;
+	st->last_errno =3D ret;
+	st->phase =3D CEPH_CLIENT_RESET_IDLE;
+	if (ret)
+		st->failure_count++;
+	else
+		st->success_count++;
+	spin_unlock(&st->lock);
+
+	/* Wake up all requests that were blocked waiting for reset */
+	wake_up_all(&st->blocked_wq);
+}
+
+static void ceph_mdsc_reset_workfn(struct work_struct *work)
+{
+	struct ceph_mds_client *mdsc =3D
+		container_of(work, struct ceph_mds_client, reset_work);
+	struct ceph_client_reset_state *st =3D &mdsc->reset_state;
+	struct ceph_client *cl =3D mdsc->fsc->client;
+	struct ceph_mds_session **sessions =3D NULL;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	int max_sessions, i, n =3D 0, torn_down =3D 0;
+	int ret =3D 0;
+
+	spin_lock(&st->lock);
+	strscpy(reason, st->last_reason, sizeof(reason));
+	spin_unlock(&st->lock);
+
+	mutex_lock(&mdsc->mutex);
+	max_sessions =3D mdsc->max_sessions;
+	if (max_sessions <=3D 0) {
+		mutex_unlock(&mdsc->mutex);
+		goto out_complete;
+	}
+
+	sessions =3D kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL);
+	if (!sessions) {
+		mutex_unlock(&mdsc->mutex);
+		ret =3D -ENOMEM;
+		pr_err_client(cl,
+			      "manual session reset failed to allocate session array\n");
+		ceph_mdsc_reset_complete(mdsc, ret);
+		return;
+	}
+
+	for (i =3D 0; i < max_sessions; i++) {
+		struct ceph_mds_session *session =3D mdsc->sessions[i];
+
+		if (!session)
+			continue;
+
+		/*
+		 * Read session state without s_mutex to avoid nesting
+		 * mdsc->mutex -> s_mutex, which would invert the
+		 * s_mutex -> mdsc->mutex order used by
+		 * cleanup_session_requests().  s_state is an int
+		 * so loads are atomic; the teardown loop below
+		 * handles races with concurrent state transitions.
+		 */
+		switch (READ_ONCE(session->s_state)) {
+		case CEPH_MDS_SESSION_OPEN:
+		case CEPH_MDS_SESSION_HUNG:
+		case CEPH_MDS_SESSION_OPENING:
+		case CEPH_MDS_SESSION_RESTARTING:
+		case CEPH_MDS_SESSION_RECONNECTING:
+		case CEPH_MDS_SESSION_CLOSING:
+			sessions[n++] =3D ceph_get_mds_session(session);
+			break;
+		default:
+			pr_info_client(cl,
+				       "mds%d in state %s, skipping reset\n",
+				       session->s_mds,
+				       ceph_session_state_name(session->s_state));
+			break;
+		}
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	pr_info_client(cl,
+		       "manual session reset executing (sessions=3D%d, reason=3D\"%s\")\=
n",
+		       n, reason);
+
+	if (n =3D=3D 0) {
+		kfree(sessions);
+		goto out_complete;
+	}
+
+	spin_lock(&st->lock);
+	st->phase =3D CEPH_CLIENT_RESET_DRAINING;
+	spin_unlock(&st->lock);
+
+	/*
+	 * Best-effort drain: flush dirty state while sessions are still
+	 * alive.  New requests are blocked while phase !=3D IDLE.
+	 * The sessions are functional, so non-stuck state drains normally.
+	 * Stuck state (the cause of the stalemate the operator is trying
+	 * to break) will not drain - that is expected, and we proceed to
+	 * forced teardown after the timeout.
+	 *
+	 * Three things are drained:
+	 *  1. MDS journal - send_flush_mdlog asks each MDS to journal
+	 *     pending unsafe operations (creates, renames, setattrs).
+	 *     Once journaled, they survive the session teardown.
+	 *  2. Dirty caps - ceph_flush_dirty_caps triggers cap flush on
+	 *     all sessions.  Non-stuck caps flush in milliseconds.
+	 *  3. Cap releases - push pending cap release messages.
+	 *
+	 * All three happen concurrently during the bounded wait window.
+	 */
+	for (i =3D 0; i < n; i++)
+		send_flush_mdlog(sessions[i]);
+
+	ceph_flush_dirty_caps(mdsc);
+	ceph_flush_cap_releases(mdsc);
+
+	spin_lock(&mdsc->cap_dirty_lock);
+	if (!list_empty(&mdsc->cap_flush_list)) {
+		struct ceph_cap_flush *cf =3D
+			list_last_entry(&mdsc->cap_flush_list,
+					struct ceph_cap_flush, g_list);
+		u64 want_flush =3D mdsc->last_cap_flush_tid;
+		long drain_ret;
+
+		/*
+		 * Setting wake on the last entry is sufficient: flush
+		 * entries complete in order, so when this entry finishes
+		 * all earlier ones are already done.
+		 */
+		cf->wake =3D true;
+		spin_unlock(&mdsc->cap_dirty_lock);
+		pr_info_client(cl,
+			       "draining (want_flush=3D%llu, %d sessions)\n",
+			       want_flush, n);
+		drain_ret =3D wait_event_timeout(mdsc->cap_flushing_wq,
+					       check_caps_flush(mdsc,
+								want_flush),
+					       CEPH_CLIENT_RESET_DRAIN_SEC * HZ);
+		if (drain_ret =3D=3D 0) {
+			pr_info_client(cl,
+				       "drain timed out, proceeding with forced teardown\n");
+			spin_lock(&st->lock);
+			st->drain_timed_out =3D true;
+			spin_unlock(&st->lock);
+		} else {
+			pr_info_client(cl, "drain completed successfully\n");
+			spin_lock(&st->lock);
+			st->drain_timed_out =3D false;
+			spin_unlock(&st->lock);
+		}
+	} else {
+		spin_unlock(&mdsc->cap_dirty_lock);
+		spin_lock(&st->lock);
+		st->drain_timed_out =3D false;
+		spin_unlock(&st->lock);
+	}
+
+	spin_lock(&st->lock);
+	st->phase =3D CEPH_CLIENT_RESET_TEARDOWN;
+	spin_unlock(&st->lock);
+
+	/*
+	 * Ask each MDS to close the session before we tear it down
+	 * locally.  Without this the MDS sees only a connection drop and
+	 * waits for the client to reconnect (up to session_autoclose
+	 * seconds) before evicting the session and releasing locks.
+	 *
+	 * Reuse the normal close machinery so the session state/sequence
+	 * snapshot is serialized under s_mutex and a racing s_seq bump
+	 * retransmits REQUEST_CLOSE while the session remains CLOSING.
+	 * We send all close requests first, then yield briefly to let the
+	 * network stack transmit them before __unregister_session()
+	 * closes the connections.
+	 */
+	for (i =3D 0; i < n; i++) {
+		int err;
+
+		mutex_lock(&sessions[i]->s_mutex);
+		err =3D __close_session(mdsc, sessions[i]);
+		mutex_unlock(&sessions[i]->s_mutex);
+		if (err < 0)
+			pr_warn_client(cl,
+				       "mds%d failed to queue close request before reset: %d\n",
+				       sessions[i]->s_mds, err);
+	}
+	if (n > 0)
+		msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS);
+
+	/*
+	 * Tear down each session: close the connection, remove all
+	 * caps, clean up requests, then kick pending requests so they
+	 * re-open a fresh session on the next attempt.
+	 *
+	 * This is modeled on the check_new_map() forced-close path
+	 * for stopped MDS ranks - a proven pattern for hard session
+	 * teardown.  We do NOT attempt send_mds_reconnect() because
+	 * the MDS only accepts reconnects during its own RECONNECT
+	 * phase (after MDS restart), not from an active client.
+	 *
+	 * Any state that did not drain (caps that didn't flush, unsafe
+	 * requests that the MDS didn't journal) is force-dropped here.
+	 * This is intentional: that state is stuck and is the reason
+	 * the operator triggered the reset.
+	 */
+	for (i =3D 0; i < n; i++) {
+		int mds =3D sessions[i]->s_mds;
+
+		pr_info_client(cl, "mds%d resetting session\n", mds);
+
+		mutex_lock(&mdsc->mutex);
+		if (mds >=3D mdsc->max_sessions ||
+		    mdsc->sessions[mds] !=3D sessions[i]) {
+			pr_info_client(cl,
+				       "mds%d session already torn down, skipping\n",
+				       mds);
+			mutex_unlock(&mdsc->mutex);
+			ceph_put_mds_session(sessions[i]);
+			continue;
+		}
+		sessions[i]->s_state =3D CEPH_MDS_SESSION_CLOSED;
+		__unregister_session(mdsc, sessions[i]);
+		__wake_requests(mdsc, &sessions[i]->s_waiting);
+		mutex_unlock(&mdsc->mutex);
+
+		mutex_lock(&sessions[i]->s_mutex);
+		cleanup_session_requests(mdsc, sessions[i]);
+		remove_session_caps(sessions[i]);
+		mutex_unlock(&sessions[i]->s_mutex);
+
+		wake_up_all(&mdsc->session_close_wq);
+
+		ceph_put_mds_session(sessions[i]);
+
+		mutex_lock(&mdsc->mutex);
+		kick_requests(mdsc, mds);
+		mutex_unlock(&mdsc->mutex);
+
+		torn_down++;
+		pr_info_client(cl, "mds%d session reset complete\n", mds);
+	}
+
+	kfree(sessions);
+
+	spin_lock(&st->lock);
+	st->sessions_reset =3D torn_down;
+	spin_unlock(&st->lock);
+
+out_complete:
+	ceph_mdsc_reset_complete(mdsc, ret);
+}
+
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason)
+{
+	struct ceph_client_reset_state *st =3D &mdsc->reset_state;
+	struct ceph_fs_client *fsc =3D mdsc->fsc;
+	const char *msg =3D (reason && reason[0]) ? reason : "manual";
+	int mount_state;
+
+	mount_state =3D READ_ONCE(fsc->mount_state);
+	if (mount_state !=3D CEPH_MOUNT_MOUNTED) {
+		pr_warn_client(fsc->client,
+			       "reset rejected: mount_state=3D%d (not mounted)\n",
+			       mount_state);
+		return -EINVAL;
+	}
+
+	spin_lock(&st->lock);
+	if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) {
+		spin_unlock(&st->lock);
+		return -EBUSY;
+	}
+
+	st->phase =3D CEPH_CLIENT_RESET_QUIESCING;
+	st->last_start =3D jiffies;
+	st->last_errno =3D 0;
+	st->drain_timed_out =3D false;
+	st->sessions_reset =3D 0;
+	st->trigger_count++;
+	strscpy(st->last_reason, msg, sizeof(st->last_reason));
+	spin_unlock(&st->lock);
+
+	if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) {
+		spin_lock(&st->lock);
+		st->phase =3D CEPH_CLIENT_RESET_IDLE;
+		st->last_errno =3D -EALREADY;
+		st->last_finish =3D jiffies;
+		st->failure_count++;
+		spin_unlock(&st->lock);
+		wake_up_all(&st->blocked_wq);
+		return -EALREADY;
+	}
+
+	pr_info_client(mdsc->fsc->client,
+		       "manual session reset scheduled (reason=3D\"%s\")\n",
+		       msg);
+	return 0;
+}
+
=20
 /*
  * compare old and new mdsmaps, kicking requests
@@ -5702,6 +6101,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 	INIT_LIST_HEAD(&mdsc->dentry_leases);
 	INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
=20
+	spin_lock_init(&mdsc->reset_state.lock);
+	init_waitqueue_head(&mdsc->reset_state.blocked_wq);
+	atomic_set(&mdsc->reset_state.blocked_requests, 0);
+	INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
+
 	ceph_caps_init(mdsc);
 	ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
=20
@@ -6227,6 +6631,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
 	/* flush out any connection work with references to us */
 	ceph_msgr_flush();
=20
+	/*
+	 * Mark reset as failed and wake any blocked waiters before
+	 * cancelling, so unmount doesn't stall on blocked_wq timeout
+	 * if cancel_work_sync() prevents the work from running.
+	 */
+	spin_lock(&mdsc->reset_state.lock);
+	mdsc->reset_state.shutdown =3D true;
+	if (mdsc->reset_state.phase !=3D CEPH_CLIENT_RESET_IDLE) {
+		mdsc->reset_state.phase =3D CEPH_CLIENT_RESET_IDLE;
+		mdsc->reset_state.last_errno =3D -ESHUTDOWN;
+		mdsc->reset_state.last_finish =3D jiffies;
+		mdsc->reset_state.failure_count++;
+	}
+	spin_unlock(&mdsc->reset_state.lock);
+	wake_up_all(&mdsc->reset_state.blocked_wq);
+
+	cancel_work_sync(&mdsc->reset_work);
 	ceph_mdsc_stop(mdsc);
=20
 	ceph_metric_destroy(&mdsc->metric);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index e91a199d56fd..afc08b0abbd5 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -74,6 +74,42 @@ struct ceph_fs_client;
 struct ceph_cap;
=20
 #define MDS_AUTH_UID_ANY -1
+#define CEPH_CLIENT_RESET_REASON_LEN	64
+#define CEPH_CLIENT_RESET_DRAIN_SEC	5
+#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100
+#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120
+
+enum ceph_client_reset_phase {
+	CEPH_CLIENT_RESET_IDLE =3D 0,
+	/*
+	 * QUIESCING is set synchronously by schedule_reset() before the
+	 * workqueue item is dispatched.  It gates new requests (any
+	 * phase !=3D IDLE blocks callers) during the window between
+	 * scheduling and the work function's transition to DRAINING.
+	 */
+	CEPH_CLIENT_RESET_QUIESCING,
+	CEPH_CLIENT_RESET_DRAINING,
+	CEPH_CLIENT_RESET_TEARDOWN,
+};
+
+struct ceph_client_reset_state {
+	spinlock_t lock;
+	u64 trigger_count;
+	u64 success_count;
+	u64 failure_count;
+	unsigned long last_start;
+	unsigned long last_finish;
+	int last_errno;
+	enum ceph_client_reset_phase phase;
+	bool drain_timed_out;
+	bool shutdown;
+	int sessions_reset;
+	char last_reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	/* Request blocking during reset */
+	wait_queue_head_t blocked_wq;
+	atomic_t blocked_requests;
+};
=20
 struct ceph_mds_cap_match {
 	s64 uid;  /* default to MDS_AUTH_UID_ANY */
@@ -536,6 +572,8 @@ struct ceph_mds_client {
 	struct list_head  dentry_dir_leases; /* lru list */
=20
 	struct ceph_client_metric metric;
+	struct work_struct	reset_work;
+	struct ceph_client_reset_state reset_state;
=20
 	spinlock_t		snapid_map_lock;
 	struct rb_root		snapid_map_tree;
@@ -559,10 +597,14 @@ extern struct ceph_mds_session *
 __ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
=20
 extern const char *ceph_session_state_name(int s);
+extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phas=
e);
=20
 extern struct ceph_mds_session *
 ceph_get_mds_session(struct ceph_mds_session *s);
 extern void ceph_put_mds_session(struct ceph_mds_session *s);
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason);
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
=20
 extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
 extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58F843E6DD8
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:57 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272519; cv=none;
 b=kprYWWAFqN+BQSoZU64ZhRHXNpKT3jPRWR+kCoeT+n8Zbg1nuTZYjhKN/9lSUCj2ZkaIQyJ7NXXKpTTkOQiqn49LJGlIKF5KQpwAk/xdgTfg6aXZwyGoMTAKbUXZsUqwrAk8fOVx6+qnR9KEp2+ZdumzoHjUfzPAOe3CJvDJ5dE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272519; c=relaxed/simple;
	bh=hF8GPPMCTtYMk3vi8Lxz1SJG7tR0daTJyKi+xKmaWuE=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=sPO/Sz6JM8e7Mo1Onp6WC1t6jPyzH7jP20zYXo0yeot0JZqpl6gp0nmup9Nekqbt1ulksgpuDTfbXr1fn1Z+gl7gNjtaMXwTsN4Arb8DwICAEK+JYzTOvbGhdBWeAglr5qX8LxPCMAAaf76dWgulB2/EVfE6qYADaJyS4fRVQIM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TdW+ar5v;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=I3tYKJ+G; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TdW+ar5v";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="I3tYKJ+G"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272516;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=;
	b=TdW+ar5vzCrQjsoU7U9J6eijXXi19KNhAhqgCaKkFo+dpjngg1ekLpHsYR6d3Vn925yEJH
	Tx7mX0hWwWp19goKvEvNtQP0At5ECVk+v4jVZ7vb0ADPy6ouHTPUYvMIa6ytQRmjKRw2T/
	g1mYhwEUqJX2YNr23w2DMAot/WZfX5E=
Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com
 [209.85.160.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-676-5qGr1ZYVNQSZTMmQtnYe_Q-1; Wed, 15 Apr 2026 13:01:55 -0400
X-MC-Unique: 5qGr1ZYVNQSZTMmQtnYe_Q-1
X-Mimecast-MFC-AGG-ID: 5qGr1ZYVNQSZTMmQtnYe_Q_1776272515
Received: by mail-qt1-f197.google.com with SMTP id
 d75a77b69052e-50d63962d83so167608931cf.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272515; x=1776877315;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=;
        b=I3tYKJ+GEzydugY/VNSCeqgJLjGY8WttVEuEIt8RpD3tDlrkYgTn+y02eWb0/mBWMZ
         Y0HGzfPUdkmrt+ir9PckYRezHRet8VSMWW/YuSpFuAXWtAOrPD2M3Qk6tKhrgc6ZmdNI
         pfSBuDcZFXSf0o81lOuHe+S0X0rEjnCobq1NZaMJReMDV0I8OW0WJlkLTHhz78rVI6y4
         e32btVtnFdZQQ+inXP/EVCYlEeWJU9PAfonXcxSJ72eru/zGln0ortBdeFuKYJwOUaAG
         S0f2kkFEkUrIsZEF6WZ0Jt5QbNEVWDXeLYIQnHMIOjUPBgh1IJhkYQWNKoZUDrWqtAea
         x4tA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272515; x=1776877315;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=;
        b=iEGVXdXyjwYrIXQonio/HXaq3oLfen0aFJ6PeaQXtaa1T37zfPbG/yZSJuv5fWBLWS
         6eaQvruhbpSmpFd1no98k3tyTonb5pIitRouJx/6y+WikLZna/hOWa/kSiTZvxgYiLbU
         cpr/AtSN8D53Za4BZam1ZUj/4z/H/lFsJpg+8BEmp/mSI2S7/83GUA3Cm88V4sCg2Meg
         psqVxoAc7b5BUX92+4iNHnBMXXkU5JzjSbpypxp2jVgEYsb4+rZvzB/1wYAQFNyrIGyD
         8CkSTZoQm3fBC/7d4q6yM1z7CV0JSoWg63zWd5hpcgFir9v6pzchZp4uPyoO05unNj1J
         3AxA==
X-Gm-Message-State: AOJu0YyPN3w40stg88x5XoGdJJyolyG4g3VstB0rUXVFiuRmfmC4kB0g
	awPD7Dgk6jVxDKEkfACMJWHCEO0R2qABD8dLrVB04GinBBWlufveWyyssGEqqjICF8Arttt9NWQ
	Q9FFRRw7SFtbFpr/pgfgj2Ejjj0ZDsLfz3aZO5163BmzfbMwPAHAx6XzZnJf9bh9ijA==
X-Gm-Gg: AeBDiesh2UQbQaZIwCXovx7gDfmkGE4rPVLwNS7+VYtFZ5AUj3DM7v0xvEVY0Ahrqbf
	uzcqY0q3KHXzVwgRI8bDeZIbYSS0gdZWU3JzrSc/ub//ohJhEPUm3ZqIu23OEdEe5SxPuiCepkr
	LGoxYiS7iyC1gZBwI/DU9ax7fbOBi268eJm0EdC972ZhbzVcliv/CCpsHZRMjQCL619zRxLoUpA
	SHt9XB74O48G7bntHFbzjxrTt33sWNQ+UFRMqe6OQT5x328JP22ovMWpXKNZdt7xVmYpqqQdmoW
	+I1+1PXsjwSEshnVrTPlQfQlgV/gnW7INaw7wj0ST+P37jeu91MuK9RTAneaj+B57BZ3yPjrNRx
	rajSqli6doR65eigVekdSjgCsG0r/Meq0qhupBNMvEgL+iS+A0wEDpucfCJGktj7zVg==
X-Received: by 2002:a05:622a:9:b0:50d:7b0c:35de with SMTP id
 d75a77b69052e-50dd5c74756mr343034801cf.44.1776272506493;
        Wed, 15 Apr 2026 10:01:46 -0700 (PDT)
X-Received: by 2002:a05:622a:9:b0:50d:7b0c:35de with SMTP id
 d75a77b69052e-50dd5c74756mr343006191cf.44.1776272475636;
        Wed, 15 Apr 2026 10:01:15 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:15 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 6/7] ceph: add manual reset debugfs control and tracepoints
Date: Wed, 15 Apr 2026 17:00:42 +0000
Message-Id: <20260415170043.3882912-7-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add the debugfs and trace plumbing used to trigger and observe
manual client reset.

The reset interface exposes a trigger file for operator-initiated
reset and a status file for tracking the most recent run.  The
tracepoints record scheduling, completion, and blocked caller
behavior so reset progress can be diagnosed from the client side.

debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
  trigger - write to initiate a manual reset
  status  - read to see the most recent reset result

Tracepoints:
  ceph_client_reset_schedule  - reset queued
  ceph_client_reset_complete  - reset finished (success or failure)
  ceph_client_reset_blocked   - caller blocked waiting for reset
  ceph_client_reset_unblocked - caller unblocked after reset

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/debugfs.c           | 104 ++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.c        |   8 +++
 fs/ceph/super.h             |   3 ++
 include/trace/events/ceph.h |  63 ++++++++++++++++++++++
 4 files changed, 178 insertions(+)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 7dc307790240..d46d41ec7a86 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,6 +9,7 @@
 #include <linux/seq_file.h>
 #include <linux/math64.h>
 #include <linux/ktime.h>
+#include <linux/uaccess.h>
=20
 #include <linux/ceph/libceph.h>
 #include <linux/ceph/mon_client.h>
@@ -360,16 +361,107 @@ static int status_show(struct seq_file *s, void *p)
 	return 0;
 }
=20
+static int reset_status_show(struct seq_file *s, void *p)
+{
+	struct ceph_fs_client *fsc =3D s->private;
+	struct ceph_mds_client *mdsc =3D fsc->mdsc;
+	struct ceph_client_reset_state *st;
+	u64 trigger =3D 0, success =3D 0, failure =3D 0;
+	unsigned long last_start =3D 0, last_finish =3D 0;
+	int last_errno =3D 0;
+	enum ceph_client_reset_phase phase =3D CEPH_CLIENT_RESET_IDLE;
+	bool drain_timed_out =3D false;
+	int sessions_reset =3D 0;
+	int blocked_requests =3D 0;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	if (!mdsc)
+		return 0;
+
+	st =3D &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	trigger =3D st->trigger_count;
+	success =3D st->success_count;
+	failure =3D st->failure_count;
+	last_start =3D st->last_start;
+	last_finish =3D st->last_finish;
+	last_errno =3D st->last_errno;
+	phase =3D st->phase;
+	drain_timed_out =3D st->drain_timed_out;
+	sessions_reset =3D st->sessions_reset;
+	strscpy(reason, st->last_reason, sizeof(reason));
+	spin_unlock(&st->lock);
+
+	blocked_requests =3D atomic_read(&st->blocked_requests);
+
+	seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase));
+	seq_printf(s, "trigger_count: %llu\n", trigger);
+	seq_printf(s, "success_count: %llu\n", success);
+	seq_printf(s, "failure_count: %llu\n", failure);
+	if (last_start)
+		seq_printf(s, "last_start_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_start));
+	else
+		seq_puts(s, "last_start_ms_ago: (never)\n");
+	if (last_finish)
+		seq_printf(s, "last_finish_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_finish));
+	else
+		seq_puts(s, "last_finish_ms_ago: (never)\n");
+	seq_printf(s, "last_errno: %d\n", last_errno);
+	seq_printf(s, "last_reason: %s\n",
+		   reason[0] ? reason : "(none)");
+	seq_printf(s, "drain_timed_out: %s\n",
+		   drain_timed_out ? "yes" : "no");
+	seq_printf(s, "sessions_reset: %d\n", sessions_reset);
+	seq_printf(s, "blocked_requests: %d\n", blocked_requests);
+
+	return 0;
+}
+
+static ssize_t reset_trigger_write(struct file *file, const char __user *b=
uf,
+				   size_t len, loff_t *ppos)
+{
+	struct ceph_fs_client *fsc =3D file->private_data;
+	struct ceph_mds_client *mdsc =3D fsc->mdsc;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	size_t copy;
+	int ret;
+
+	if (!mdsc)
+		return -ENODEV;
+
+	copy =3D min_t(size_t, len, sizeof(reason) - 1);
+	if (copy && copy_from_user(reason, buf, copy))
+		return -EFAULT;
+	reason[copy] =3D '\0';
+	strim(reason);
+
+	ret =3D ceph_mdsc_schedule_reset(mdsc, reason);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
 DEFINE_SHOW_ATTRIBUTE(mdsmap);
 DEFINE_SHOW_ATTRIBUTE(mdsc);
 DEFINE_SHOW_ATTRIBUTE(caps);
 DEFINE_SHOW_ATTRIBUTE(mds_sessions);
 DEFINE_SHOW_ATTRIBUTE(status);
+DEFINE_SHOW_ATTRIBUTE(reset_status);
 DEFINE_SHOW_ATTRIBUTE(metrics_file);
 DEFINE_SHOW_ATTRIBUTE(metrics_latency);
 DEFINE_SHOW_ATTRIBUTE(metrics_size);
 DEFINE_SHOW_ATTRIBUTE(metrics_caps);
=20
+static const struct file_operations ceph_reset_trigger_fops =3D {
+	.owner =3D THIS_MODULE,
+	.open =3D simple_open,
+	.write =3D reset_trigger_write,
+	.llseek =3D noop_llseek,
+};
=20
 /*
  * debugfs
@@ -404,6 +496,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 	debugfs_remove(fsc->debugfs_caps);
 	debugfs_remove(fsc->debugfs_status);
 	debugfs_remove(fsc->debugfs_mdsc);
+	debugfs_remove_recursive(fsc->debugfs_reset_dir);
 	debugfs_remove_recursive(fsc->debugfs_metrics_dir);
 	doutc(fsc->client, "done\n");
 }
@@ -451,6 +544,17 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 						fsc,
 						&caps_fops);
=20
+	fsc->debugfs_reset_dir =3D debugfs_create_dir("reset",
+						    fsc->client->debugfs_dir);
+	fsc->debugfs_reset_trigger =3D
+		debugfs_create_file("trigger", 0200,
+				    fsc->debugfs_reset_dir, fsc,
+				    &ceph_reset_trigger_fops);
+	fsc->debugfs_reset_status =3D
+		debugfs_create_file("status", 0400,
+				    fsc->debugfs_reset_dir, fsc,
+				    &reset_status_fops);
+
 	fsc->debugfs_status =3D debugfs_create_file("status",
 						  0400,
 						  fsc->client->debugfs_dir,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 7e399b0dcc55..98a882cf8b65 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -5213,6 +5213,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *=
mdsc)
 	blocked_count =3D atomic_inc_return(&st->blocked_requests);
 	doutc(cl, "request blocked during reset, %d total blocked\n",
 	      blocked_count);
+	trace_ceph_client_reset_blocked(mdsc, blocked_count);
=20
 retry:
 	wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq,
@@ -5223,10 +5224,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client=
 *mdsc)
 	if (wait_ret =3D=3D 0) {
 		atomic_dec(&st->blocked_requests);
 		pr_warn_client(cl, "timed out waiting for reset to complete\n");
+		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
 		return -ETIMEDOUT;
 	}
 	if (wait_ret < 0) {
 		atomic_dec(&st->blocked_requests);
+		trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret);
 		return (int)wait_ret;  /* -ERESTARTSYS */
 	}
=20
@@ -5241,12 +5244,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client=
 *mdsc)
 		if (time_before(jiffies, deadline))
 			goto retry;
 		atomic_dec(&st->blocked_requests);
+		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
 		return -ETIMEDOUT;
 	}
 	ret =3D st->last_errno;
 	spin_unlock(&st->lock);
=20
 	atomic_dec(&st->blocked_requests);
+	trace_ceph_client_reset_unblocked(mdsc, ret);
 	return ret;
 }
=20
@@ -5275,6 +5280,8 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_=
client *mdsc, int ret)
=20
 	/* Wake up all requests that were blocked waiting for reset */
 	wake_up_all(&st->blocked_wq);
+
+	trace_ceph_client_reset_complete(mdsc, ret);
 }
=20
 static void ceph_mdsc_reset_workfn(struct work_struct *work)
@@ -5559,6 +5566,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *=
mdsc,
 	pr_info_client(mdsc->fsc->client,
 		       "manual session reset scheduled (reason=3D\"%s\")\n",
 		       msg);
+	trace_ceph_client_reset_schedule(mdsc, msg);
 	return 0;
 }
=20
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 1f901b1647e6..98af0a823c81 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -179,6 +179,9 @@ struct ceph_fs_client {
 	struct dentry *debugfs_status;
 	struct dentry *debugfs_mds_sessions;
 	struct dentry *debugfs_metrics_dir;
+	struct dentry *debugfs_reset_dir;
+	struct dentry *debugfs_reset_trigger;
+	struct dentry *debugfs_reset_status;
 #endif
=20
 #ifdef CONFIG_CEPH_FSCACHE
diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
index 08cb0659fbfc..e853c891ef71 100644
--- a/include/trace/events/ceph.h
+++ b/include/trace/events/ceph.h
@@ -226,6 +226,69 @@ TRACE_EVENT(ceph_handle_caps,
 		  __entry->mseq)
 );
=20
+/*
+ * Client reset tracepoints - identify the client by its monitor-
+ * assigned global_id so traces remain meaningful when kernel pointer
+ * hashing is enabled.
+ */
+TRACE_EVENT(ceph_client_reset_schedule,
+	TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
+	TP_ARGS(mdsc, reason),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__string(reason, reason ? reason : "")
+	),
+	TP_fast_assign(
+		__entry->client_id =3D mdsc->fsc->client->monc.auth->global_id;
+		__assign_str(reason);
+	),
+	TP_printk("client_id=3D%llu reason=3D%s",
+		  __entry->client_id, __get_str(reason))
+);
+
+TRACE_EVENT(ceph_client_reset_complete,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->client_id =3D mdsc->fsc->client->monc.auth->global_id;
+		__entry->ret =3D ret;
+	),
+	TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret)
+);
+
+TRACE_EVENT(ceph_client_reset_blocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
+	TP_ARGS(mdsc, blocked_count),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, blocked_count)
+	),
+	TP_fast_assign(
+		__entry->client_id =3D mdsc->fsc->client->monc.auth->global_id;
+		__entry->blocked_count =3D blocked_count;
+	),
+	TP_printk("client_id=3D%llu blocked_count=3D%d", __entry->client_id,
+		  __entry->blocked_count)
+);
+
+TRACE_EVENT(ceph_client_reset_unblocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->client_id =3D mdsc->fsc->client->monc.auth->global_id;
+		__entry->ret =3D ret;
+	),
+	TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret)
+);
+
 #undef EM
 #undef E_
 #endif /* _TRACE_CEPH_H */
--=20
2.34.1
From nobody Sat Jun 20 11:50:30 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0226D3E5576
	for <linux-kernel@vger.kernel.org>; Wed, 15 Apr 2026 17:01:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1776272494; cv=none;
 b=c9/bhAprNCgAd5J3ueVShW2Bjnig/ypVfFA+SEHdA4RpEEhK1Bckirm/+YmSo4beHEx+yVTuQFliAvP1RI06Gwe+ETi2EHojDBD0XRIl15uBtJBaBlwtVJ8IzL/GEuFPNWuVYBrVOaFWqzhsgXX0IajyFprMZlh9Vukl2Yia1h8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1776272494; c=relaxed/simple;
	bh=8gT6vokTWEKbUxvLqk4CByx7e5d1SpvfNxrx6daGEGY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=EpGymvBP6qbIkXTeFjSo8d7crhrMtQKzXinlbamRxOm0kkGnIARlZWs4elSMN+rGox1RiKiA2jmX7fKKTk07ZypZU6st0cyrBCq9YtzlQGMorojMZMEDMSvSKLa7SSE3i5EqyNSGsb5klEJc3G1fYrnvKW3ANWeKUHiy2D3ZR8A=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Y5niHz17;
 dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Wp8ZB46h; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Y5niHz17";
	dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Wp8ZB46h"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1776272488;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=;
	b=Y5niHz17TVz1rIzxBfeHG16cIJu6F9OlGawOcF8IyQwwZcx4pa50CZykUnnrrl7VBBE9f7
	xzLpYfglMVKJbNPvUAZaprqg6Sqkdu1uvxn0EWnbtrRQi0VzfxKU6DKDYewK+UpLFFLoc5
	oxiDRHznAreyfDC2qoEBUUGTZ/NDBWo=
Received: from mail-ot1-f69.google.com (mail-ot1-f69.google.com
 [209.85.210.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-277-WPhmCdoCMdeKzyQBLIxEQQ-1; Wed, 15 Apr 2026 13:01:26 -0400
X-MC-Unique: WPhmCdoCMdeKzyQBLIxEQQ-1
X-Mimecast-MFC-AGG-ID: WPhmCdoCMdeKzyQBLIxEQQ_1776272485
Received: by mail-ot1-f69.google.com with SMTP id
 46e09a7af769-7d9d60f8e3aso13836722a34.3
        for <linux-kernel@vger.kernel.org>;
 Wed, 15 Apr 2026 10:01:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=redhat.com; s=google; t=1776272485; x=1776877285;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=;
        b=Wp8ZB46h427/wMe2iECK6izSv3+Z9F9mkKfLHHKDtT9LVTgK8jaWZscS+/u/PrNzxX
         SQ1LbA2t4c1BQYBYupY1Pd3bH9nZMUiOeeNUcpdPYM6l7677wP4QOabTZsc2NGuaGnob
         9l+ly7uD/L2GbhxdPjD2o/F38IdCZzh13/3Q66yupISaYIPevZ0QX6j6zJd2TRx53ONF
         32jUC+qey7+XV5sHjAG2O+EI9P3iU2BM+bm8Lbpw1ZG9AekDXiveAwG8F1SVLdTellhx
         hS9dRAMZq74UDK+xKsvTGorAXg4GQrPepgQP9iwuaEgwUJ5wFUw46gIu9gVy3ygnvu1S
         sq1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1776272485; x=1776877285;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=;
        b=sKOX9H5Qc0/MMxcxyXETWYvm6KRJh552FrlC/+Ij6PbdHDd41488qJ0cV2ynqcQV3a
         w42yzUwqwdNIOZIw97BB+YoiFl3q8Mty4gXn8OnyDn6MR5vA0/Zx6Ek6Qkqp96UTsdpv
         1guUM5SXy1cEUML0uhiJljXusNTnyhWkRff10Z/t5zEbklXQSZka4wPfYXhL9lTE3ubG
         NmU4dWCAHOuek5LHl/yq0wVpJBV+pxLa8/ufLVGKsUIxbXRWQmNK0pe3G70P01gH2rHZ
         5Chp26bVmD1PX9GtYdTG95cqBlkYN/sATo2/9yPTFCfkYMm1GvVdqeqgvlofCaOcf1oi
         4oYQ==
X-Gm-Message-State: AOJu0YwwIWDyvQ7z/KnAcgNbDhMeR02sxIynB75iWwJwS64rdhCaK0ts
	lKnoOWVN1Kp44jl9otUjhaZTjyb2r/CZ6jM+J1AhF+H43UlBfd+I+jXIITm6x3dDvh6Yzo2wQEy
	4rYXiUJ7Z7dy9gxZ0i946AmfIvLz6jG8si/mo29tu4plfeqrLC3uPl8t3V9UAeAvi62JhTber5w
	eW
X-Gm-Gg: AeBDiestuDbXc0fRn3jYlpm6ZJU4A/zThsbSDOoHe+aH20fC0siyWj6bfvKEKbY7k21
	U6T83cgL7Lasu+NhUXldmFdupT+WM0jG5JuInWDnljkyXl96SUcNnk9lMrTVXPkfHBMKguS1Kpd
	33R288n1UhwwmF5l6MJjV+oQ/Fwh+4CXlX2NA2X5m+m7n5UBQeeKDY7pkFtr+t2em3OykHfqylK
	DrRWft/WxenxjJ9cxEL90g24KKZ7dYfT/C9feMqfhJkk/nqTNPbUnO1ONB8Pt0lBndyiyYgKu4Y
	+cbfRszBXHpDcd0CC2NHpm5AExLkIm4zzaRCZOP7Mv2KeU8ambfP57wyQndsiGQ+HefJsQJcsxH
	AkDyP/ha2cqMCsoJ6d/MBDlmtn14kVw6vZyowu7lc/9u8aymJnu+rjJGperL/EhS4Mw==
X-Received: by 2002:a05:6820:168d:b0:68a:d414:b428 with SMTP id
 006d021491bc7-68be8fcba82mr11040519eaf.59.1776272480465;
        Wed, 15 Apr 2026 10:01:20 -0700 (PDT)
X-Received: by 2002:a05:6820:168d:b0:68a:d414:b428 with SMTP id
 006d021491bc7-68be8fcba82mr11040407eaf.59.1776272478882;
        Wed, 15 Apr 2026 10:01:18 -0700 (PDT)
Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com.
 [13.121.85.79])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 15 Apr 2026 10:01:18 -0700 (PDT)
From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	idryomov@gmail.com,
	vdubeyko@redhat.com,
	Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v2 7/7] ceph: add manual reset selftests and validation
 harness
Date: Wed, 15 Apr 2026 17:00:43 +0000
Message-Id: <20260415170043.3882912-8-amarkuze@redhat.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com>
References: <20260415170043.3882912-1-amarkuze@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add single-client selftests and a validation wrapper for manual
client reset.

The test set covers reset stress under concurrent metadata
activity together with targeted corner cases for overlap,
dirty-state handling, stale lock behavior, and unmount while reset
is active.  A validation wrapper runs the individual stages with
watchdog timeouts and captures the final reset status for post-run
checks.

The stress validator checks failure_count in addition to
last_errno so that transient mid-run reset failures are caught
even when a later reset succeeds.

Keep the test scope intentionally focused on the shipped
single-client reset behavior so the series includes a practical
regression signal for the final design.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 MAINTAINERS                                   |   1 +
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/filesystems/ceph/Makefile       |   7 +
 .../selftests/filesystems/ceph/README.md      |  84 +++
 .../filesystems/ceph/reset_corner_cases.sh    | 646 ++++++++++++++++
 .../filesystems/ceph/reset_stress.sh          | 694 ++++++++++++++++++
 .../filesystems/ceph/run_validation.sh        | 350 +++++++++
 .../selftests/filesystems/ceph/settings       |   1 +
 .../filesystems/ceph/validate_consistency.py  | 297 ++++++++
 9 files changed, 2081 insertions(+)
 create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
 create mode 100644 tools/testing/selftests/filesystems/ceph/README.md
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_c=
ases.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation=
.sh
 create mode 100644 tools/testing/selftests/filesystems/ceph/settings
 create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consi=
stency.py

diff --git a/MAINTAINERS b/MAINTAINERS
index d1cc0e12fe1f..87c36a26c1f2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5917,6 +5917,7 @@ B:	https://tracker.ceph.com/
 T:	git https://github.com/ceph/ceph-client.git
 F:	Documentation/filesystems/ceph.rst
 F:	fs/ceph/
+F:	tools/testing/selftests/filesystems/ceph/
=20
 CERTIFICATE HANDLING
 M:	David Howells <dhowells@redhat.com>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Mak=
efile
index 450f13ba4cca..81c01a7062e0 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -32,6 +32,7 @@ TARGETS +=3D exec
 TARGETS +=3D fchmodat2
 TARGETS +=3D filesystems
 TARGETS +=3D filesystems/binderfs
+TARGETS +=3D filesystems/ceph
 TARGETS +=3D filesystems/epoll
 TARGETS +=3D filesystems/fat
 TARGETS +=3D filesystems/overlayfs
diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/test=
ing/selftests/filesystems/ceph/Makefile
new file mode 100644
index 000000000000..3ad768bc8420
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_PROGS :=3D run_validation.sh
+TEST_FILES :=3D reset_stress.sh reset_corner_cases.sh \
+	      validate_consistency.py README.md settings
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/filesystems/ceph/README.md b/tools/tes=
ting/selftests/filesystems/ceph/README.md
new file mode 100644
index 000000000000..47931edf52b0
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/README.md
@@ -0,0 +1,84 @@
+# CephFS Client Reset Test Suite
+
+Test suite for the CephFS kernel client manual session reset feature.
+This trimmed set contains the single-client stress test, the targeted
+corner-case test, and the one-shot validation harness used during
+feature bring-up.
+
+## Prerequisites
+
+- Linux kernel with the CephFS client reset feature (this branch)
+- A running Ceph cluster with at least one MDS
+- Root access (debugfs requires it)
+- Python 3 (for validators)
+- flock utility (for lock tests, usually in util-linux)
+
+## Test inventory
+
+| Test | Script(s) | What it covers |
+|------|-----------|----------------|
+| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity=
 on one mount |
+| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclai=
m, unmount-during-reset |
+| Validation harness | `run_validation.sh` | baseline + corner cases + mod=
erate/aggressive stress + final status check |
+
+## Quick start
+
+Stress run:
+
+    sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate
+
+Corner cases:
+
+    sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs
+
+End-to-end validation:
+
+    sudo ./run_validation.sh --mount-point /mnt/cephfs
+
+## Stress profiles
+
+    baseline   - no resets, 1 IO + 1 rename, 600s
+    moderate   - reset every 5-15s, 2 IO + 1 rename, 900s
+    aggressive - reset every 1-5s, 4 IO + 2 rename, 900s
+    soak       - reset every 5-15s, 2 IO + 1 rename, 3600s
+
+## Key options (all scripts)
+
+    --mount-point PATH   CephFS mount point (required)
+    --client-id ID       Debugfs client id (auto-detected if one)
+
+reset_stress.sh additionally accepts:
+
+    --profile NAME       baseline|moderate|aggressive|soak
+    --duration-sec N     Override profile runtime
+    --no-reset           Disable reset injection
+    --out-dir PATH       Artifact directory
+
+## Corner case tests
+
+    [1/4] ebusy_rejection       Second reset rejected while first in-flight
+    [2/4] dirty_caps_at_reset   Reset with unflushed dirty caps
+    [3/4] flock_after_reset     Stale lock EIO + fresh lock after holder e=
xit
+    [4/4] unmount_during_reset  umount during active reset (ESHUTDOWN path)
+
+Test 4 requires creating a second CephFS mount instance and SKIPs if
+the host cannot do so.  See `--help` output for details.
+
+## Troubleshooting
+
+**No writable Ceph reset interface found:**
+Kernel lacks the reset feature, debugfs not mounted, or not root.
+Check: `ls /sys/kernel/debug/ceph/*/reset/`
+
+**Multiple Ceph clients found:**
+Use `--client-id` to select one.
+List: `ls /sys/kernel/debug/ceph/`
+
+## Files
+
+| File | Role |
+|------|------|
+| `reset_stress.sh` | Single-client stress test runner |
+| `validate_consistency.py` | Single-client post-run validator |
+| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) |
+| `run_validation.sh` | One-shot validation harness |
diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh=
 b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
new file mode 100755
index 000000000000..a6dae84a616d
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
@@ -0,0 +1,646 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset corner case tests.
+# Runs a checklist of targeted tests that exercise specific reset
+# code paths not covered by the stress tests.
+#
+# Requires: mounted CephFS, debugfs access (root), flock(1) utility.
+
+set -uo pipefail
+
+KSFT_SKIP=3D4
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=3D""
+DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph"
+DEBUGFS_CLIENT=3D""
+TRIGGER_PATH=3D""
+STATUS_PATH=3D""
+TEMP_MNT=3D""
+
+PASS_COUNT=3D0
+FAIL_COUNT=3D0
+SKIP_COUNT=3D0
+TOTAL=3D4
+
+log()
+{
+	printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1"
+}
+
+result()
+{
+	local num=3D"$1"
+	local name=3D"$2"
+	local status=3D"$3"
+	local detail=3D"${4:-}"
+
+	case "$status" in
+	PASS) PASS_COUNT=3D$((PASS_COUNT + 1)) ;;
+	FAIL) FAIL_COUNT=3D$((FAIL_COUNT + 1)) ;;
+	SKIP) SKIP_COUNT=3D$((SKIP_COUNT + 1)) ;;
+	esac
+
+	if [[ -n "$detail" ]]; then
+		printf '[%d/%d] %-30s %s  (%s)\n' "$num" "$TOTAL" "$name" "$status" "$de=
tail"
+	else
+		printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status"
+	fi
+}
+
+read_status_field()
+{
+	local field=3D"$1"
+	awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$STATUS_PATH" 2>=
/dev/null
+}
+
+wait_reset_done()
+{
+	local timeout=3D"${1:-30}"
+	local elapsed=3D0
+
+	while [[ "$(read_status_field "phase")" !=3D "idle" ]]; do
+		sleep 1
+		elapsed=3D$((elapsed + 1))
+		if [[ "$elapsed" -ge "$timeout" ]]; then
+			return 1
+		fi
+	done
+	return 0
+}
+
+list_reset_clients()
+{
+	local entry
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		entry=3D"$(basename "$entry")"
+		[[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+		[[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+		printf '%s\n' "$entry"
+	done
+}
+
+wait_status_nonidle()
+{
+	local status_path=3D"$1"
+	local timeout=3D"${2:-10}"
+	local polls=3D$((timeout * 10))
+	local phase
+
+	while [[ "$polls" -gt 0 ]]; do
+		phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$status_path" 2>/d=
ev/null)"
+		if [[ -n "$phase" && "$phase" !=3D "idle" ]]; then
+			return 0
+		fi
+		sleep 0.1
+		polls=3D$((polls - 1))
+	done
+
+	return 1
+}
+
+discover_debugfs()
+{
+	local candidates=3D()
+	local entry
+
+	if [[ -n "$DEBUGFS_CLIENT" ]]; then
+		if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then
+			echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2
+			exit "$KSFT_SKIP"
+		fi
+		return 0
+	fi
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		entry=3D"$(basename "$entry")"
+		[[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+		[[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+		candidates+=3D("$entry")
+	done
+
+	if [[ ${#candidates[@]} -eq 0 ]]; then
+		echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" =
>&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if [[ ${#candidates[@]} -gt 1 ]]; then
+		echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-=
id." >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	DEBUGFS_CLIENT=3D"${candidates[0]}"
+}
+
+# --- Test 1: ebusy_rejection --------------------------------------------=
----
+#
+# Trigger a reset while another is guaranteed in-flight.  Creates
+# dirty state so the first reset enters DRAINING (which takes
+# measurable time), then polls until phase !=3D idle and issues the
+# second trigger.  The second trigger must fail (the kernel returns
+# -EBUSY), and only one reset must be counted in the accounting.
+
+test_ebusy_rejection()
+{
+	local num=3D1
+	local name=3D"ebusy_rejection"
+	local testfile=3D"$MOUNT_POINT/.reset_corner_ebusy_$$"
+	local tc_before tc_after sc_before sc_after second_rc phase elapsed
+
+	tc_before=3D"$(read_status_field "trigger_count")"
+	sc_before=3D"$(read_status_field "success_count")"
+
+	# Create dirty state so the first reset enters DRAINING
+	echo "ebusy_dirty_data" > "$testfile"
+	sync "$testfile"
+
+	python3 -c "
+import os, sys
+fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_ebusy_test\n')
+sys.stdout.write('written')
+" 2>/dev/null || {
+		result "$num" "$name" FAIL "dirty write failed"
+		rm -f "$testfile"
+		return
+	}
+
+	# Trigger the first reset -- it will drain dirty state
+	echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || {
+		result "$num" "$name" FAIL "first trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	# Poll until phase is non-idle (quiescing or draining)
+	elapsed=3D0
+	while true; do
+		phase=3D"$(read_status_field "phase")"
+		if [[ "$phase" !=3D "idle" ]]; then
+			break
+		fi
+		sleep 0.1
+		elapsed=3D$((elapsed + 1))
+		if [[ "$elapsed" -ge 50 ]]; then
+			result "$num" "$name" SKIP \
+				"first reset completed before overlap could be tested"
+			rm -f "$testfile" 2>/dev/null
+			return
+		fi
+	done
+
+	# Issue the second trigger -- should be rejected with EBUSY
+	second_rc=3D0
+	echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=3D0 || sec=
ond_rc=3D$?
+
+	if ! wait_reset_done 30; then
+		result "$num" "$name" FAIL "first reset never completed"
+		rm -f "$testfile"
+		return
+	fi
+
+	tc_after=3D"$(read_status_field "trigger_count")"
+	sc_after=3D"$(read_status_field "success_count")"
+
+	if [[ "$((tc_after - tc_before))" -ne 1 ]]; then
+		result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), ex=
pected +1"
+		rm -f "$testfile"
+		return
+	fi
+
+	if [[ "$((sc_after - sc_before))" -ne 1 ]]; then
+		result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), ex=
pected +1"
+		rm -f "$testfile"
+		return
+	fi
+
+	if [[ "$second_rc" -eq 0 ]]; then
+		result "$num" "$name" FAIL "second trigger did not return error"
+		rm -f "$testfile"
+		return
+	fi
+
+	rm -f "$testfile" 2>/dev/null
+	result "$num" "$name" PASS
+}
+
+# --- Test 2: dirty_caps_at_reset ----------------------------------------=
----
+#
+# Write to a file without fsync (dirty caps), trigger reset, then
+# verify the file is not corrupt.  Manual reset drains dirty caps
+# before teardown (best-effort, 5s timeout).  For a non-stuck cap
+# the dirty write should be flushed during drain and persist.
+# If the drain window is too short, only the synced first line
+# persists -- that is acceptable (data loss is documented for
+# unflushed writes).
+
+test_dirty_caps_at_reset()
+{
+	local num=3D2
+	local name=3D"dirty_caps_at_reset"
+	local testfile=3D"$MOUNT_POINT/.reset_corner_dirty_caps_$$"
+	local content_after line_count sc_before sc_after le
+
+	sc_before=3D"$(read_status_field "success_count")"
+
+	echo "line_1_before_dirty_write" > "$testfile"
+	sync "$testfile"
+
+	python3 -c "
+import os, sys
+fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'line_2_dirty_no_fsync\n')
+# deliberately no fsync -- leave caps dirty
+sys.stdout.write('written')
+" 2>/dev/null || {
+		result "$num" "$name" FAIL "dirty write failed"
+		rm -f "$testfile"
+		return
+	}
+
+	echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || {
+		result "$num" "$name" FAIL "reset trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	if ! wait_reset_done 30; then
+		result "$num" "$name" FAIL "reset did not complete"
+		rm -f "$testfile"
+		return
+	fi
+
+	sc_after=3D"$(read_status_field "success_count")"
+	if [[ "$sc_after" -le "$sc_before" ]]; then
+		result "$num" "$name" FAIL "success_count did not increment (reset not e=
xercised)"
+		rm -f "$testfile"
+		return
+	fi
+
+	sync "$testfile" 2>/dev/null || true
+	content_after=3D"$(cat "$testfile" 2>/dev/null)" || {
+		result "$num" "$name" FAIL "cannot read file after reset"
+		rm -f "$testfile"
+		return
+	}
+
+	if [[ -z "$content_after" ]]; then
+		result "$num" "$name" FAIL "file is empty after reset"
+		rm -f "$testfile"
+		return
+	fi
+
+	line_count=3D"$(echo "$content_after" | wc -l)"
+	if [[ "$line_count" -lt 1 ]]; then
+		result "$num" "$name" FAIL "file has $line_count lines, expected >=3D 1"
+		rm -f "$testfile"
+		return
+	fi
+
+	echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || {
+		result "$num" "$name" FAIL "first line corrupted"
+		rm -f "$testfile"
+		return
+	}
+
+	le=3D"$(read_status_field "last_errno")"
+	if [[ "$le" !=3D "0" ]]; then
+		result "$num" "$name" FAIL "last_errno=3D$le, expected 0"
+		rm -f "$testfile"
+		return
+	fi
+
+	rm -f "$testfile"
+	result "$num" "$name" PASS "file intact ($line_count lines)"
+}
+
+# --- Test 3: flock_after_reset ------------------------------------------=
----
+#
+# Take an exclusive flock, trigger reset, verify stale lock state is
+# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns
+# EIO).  After the original holder exits (releasing the local lock
+# reference and clearing the error flag), a fresh lock can be acquired.
+#
+# The lock holder uses the fd-based flock form with exec, so killing
+# $lock_pid closes the lock fd immediately (no orphaned child with an
+# inherited fd copy that would prevent the VFS flock release).
+
+test_flock_after_reset()
+{
+	local num=3D3
+	local name=3D"flock_after_reset"
+	local testfile=3D"$MOUNT_POINT/.reset_corner_flock_$$"
+	local lock_pid probe_rc sc_before sc_after
+
+	sc_before=3D"$(read_status_field "success_count")"
+
+	echo "flock_test_content" > "$testfile"
+	sync "$testfile"
+
+	# Hold lock via fd in a subshell; exec ensures killing $lock_pid
+	# closes the lock fd directly (no fork/child fd inheritance).
+	(
+		exec 9<"$testfile"
+		flock --exclusive --nonblock 9 || exit 1
+		exec sleep 300
+	) &
+	lock_pid=3D$!
+	sleep 0.5
+
+	if ! kill -0 "$lock_pid" 2>/dev/null; then
+		result "$num" "$name" FAIL "flock holder died immediately"
+		rm -f "$testfile"
+		return
+	fi
+
+	echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || {
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "reset trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	if ! wait_reset_done 30; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "reset did not complete"
+		rm -f "$testfile"
+		return
+	fi
+
+	sc_after=3D"$(read_status_field "success_count")"
+	if [[ "$sc_after" -le "$sc_before" ]]; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "success_count did not increment"
+		rm -f "$testfile"
+		return
+	fi
+
+	# After teardown, CEPH_I_ERROR_FILELOCK is set on the inode.
+	# A same-client lock attempt should fail (EIO), NOT succeed.
+	probe_rc=3D0
+	flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=3D0=
 || probe_rc=3D$?
+	if [[ "$probe_rc" -eq 0 ]]; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL \
+			"same-client probe succeeded, expected EIO from stale lock state"
+		rm -f "$testfile"
+		return
+	fi
+
+	# Kill the holder -- the exec'd sleep IS $lock_pid, so killing it
+	# closes fd 9 directly.  VFS flock release fires ceph_fl_release_lock(),
+	# which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK.
+	kill "$lock_pid" 2>/dev/null
+	wait "$lock_pid" 2>/dev/null
+
+	# After the holder exits, a fresh lock should be acquirable.
+	# The reset teardown sends SESSION_REQUEST_CLOSE so the MDS
+	# releases locks promptly, but retry briefly in case the
+	# message races with the connection close.
+	local attempt
+	probe_rc=3D1
+	for attempt in 1 2 3 4 5; do
+		probe_rc=3D0
+		flock --exclusive --nonblock "$testfile" true 2>/dev/null \
+			&& probe_rc=3D0 || probe_rc=3D$?
+		[[ "$probe_rc" -eq 0 ]] && break
+		sleep 1
+	done
+	if [[ "$probe_rc" -ne 0 ]]; then
+		result "$num" "$name" FAIL \
+			"cannot acquire fresh lock after holder exit (rc=3D$probe_rc, ${attempt=
} attempts)"
+		rm -f "$testfile"
+		return
+	fi
+
+	# Verify file content survived
+	grep -q "flock_test_content" "$testfile" 2>/dev/null || {
+		result "$num" "$name" FAIL "file content corrupted after reset"
+		rm -f "$testfile"
+		return
+	}
+
+	rm -f "$testfile"
+	result "$num" "$name" PASS "stale lock detected, fresh lock acquired afte=
r holder exit"
+}
+
+# --- Test 4: unmount_during_reset ---------------------------------------=
----
+#
+# Mount a fresh CephFS, trigger reset, immediately unmount. The
+# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN
+# and not hang.
+
+test_unmount_during_reset()
+{
+	local num=3D4
+	local name=3D"unmount_during_reset"
+	local temp_mnt=3D"/tmp/ceph_corner_mnt_$$"
+	local mount_opts=3D""
+	local mount_src=3D""
+	local temp_trigger=3D""
+	local temp_status=3D""
+	local temp_client=3D""
+	local temp_file=3D"$temp_mnt/.reset_corner_umount_$$"
+	local phase=3D""
+	local trigger_ok=3D0
+	local attempt
+	local -a new_clients=3D()
+	declare -A existing_clients=3D()
+
+	mount_src=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "cep=
h" {print $1; exit}' /proc/mounts 2>/dev/null)"
+	mount_opts=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "ce=
ph" {print $4; exit}' /proc/mounts 2>/dev/null)"
+
+	if [[ -z "$mount_src" ]]; then
+		result "$num" "$name" SKIP "cannot determine mount source from /proc/mou=
nts"
+		return
+	fi
+
+	while IFS=3D read -r existing; do
+		[[ -n "$existing" ]] || continue
+		existing_clients["$existing"]=3D1
+	done < <(list_reset_clients)
+
+	mkdir -p "$temp_mnt"
+
+	if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null;=
 then
+		result "$num" "$name" SKIP "cannot mount additional CephFS instance"
+		rmdir "$temp_mnt" 2>/dev/null
+		return
+	fi
+
+	ls "$temp_mnt" > /dev/null 2>&1
+	sync
+	sleep 1
+
+	for attempt in $(seq 1 50); do
+		new_clients=3D()
+		while IFS=3D read -r entry; do
+			[[ -n "$entry" ]] || continue
+			if [[ -n "${existing_clients[$entry]+x}" ]]; then
+				continue
+			fi
+			new_clients+=3D("$entry")
+		done < <(list_reset_clients)
+
+		if [[ "${#new_clients[@]}" -eq 1 ]]; then
+			temp_client=3D"${new_clients[0]}"
+			break
+		fi
+
+		if [[ "${#new_clients[@]}" -gt 1 ]]; then
+			break
+		fi
+
+		sleep 0.1
+	done
+
+	if [[ -z "$temp_client" ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" SKIP "cannot identify debugfs client for temp moun=
t"
+		return
+	fi
+
+	if [[ "${#new_clients[@]}" -gt 1 ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" SKIP "multiple new debugfs clients appeared"
+		return
+	fi
+
+	temp_trigger=3D"$DEBUGFS_ROOT/$temp_client/reset/trigger"
+	temp_status=3D"$DEBUGFS_ROOT/$temp_client/reset/status"
+
+	echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || {
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot create dirty state on temp mount"
+		return
+	}
+	sync "$temp_file"
+	python3 -c "
+import os, sys
+fd =3D os.open('$temp_file', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_umount_test\\n')
+os.close(fd)
+" 2>/dev/null || {
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap"
+		return
+	}
+
+	echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=3D1 || tr=
igger_ok=3D0
+	if [[ "$trigger_ok" -ne 1 ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot trigger reset on temp mount"
+		return
+	fi
+
+	if ! wait_status_nonidle "$temp_status" 10; then
+		phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$temp_status" 2>/d=
ev/null)"
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL \
+			"reset never became active before umount (phase=3D${phase:-unknown})"
+		return
+	fi
+
+	local umount_ok=3D0
+	timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=3D1
+
+	if [[ "$umount_ok" -ne 1 ]]; then
+		umount -l "$temp_mnt" 2>/dev/null || true
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "umount hung for >30s"
+		return
+	fi
+
+	rmdir "$temp_mnt" 2>/dev/null
+
+	ls "$MOUNT_POINT" > /dev/null 2>&1 || {
+		result "$num" "$name" FAIL "original mount unhealthy after test"
+		return
+	}
+
+	result "$num" "$name" PASS
+}
+
+# --- Main ---------------------------------------------------------------=
-----
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <path> [--client-id <id>] [--debugfs-root <path>]
+
+Runs targeted corner-case tests for the CephFS client reset feature.
+Requires root (debugfs access) and a mounted CephFS filesystem.
+
+Options:
+  --mount-point PATH     CephFS mount point (required)
+  --client-id ID         Ceph debugfs client id (auto-detect if one client)
+  --debugfs-root PATH    Debugfs ceph root (default: /sys/kernel/debug/cep=
h)
+  --help                 Show this message
+EOF
+}
+
+main()
+{
+	while [[ $# -gt 0 ]]; do
+		case "$1" in
+		--mount-point)   MOUNT_POINT=3D"$2"; shift 2 ;;
+		--client-id)     DEBUGFS_CLIENT=3D"$2"; shift 2 ;;
+		--debugfs-root)  DEBUGFS_ROOT=3D"$2"; shift 2 ;;
+		--help|-h)       usage; exit 0 ;;
+		*)               echo "Unknown option: $1" >&2; usage; exit 2 ;;
+		esac
+	done
+
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "--mount-point is required" >&2
+		usage
+		exit 2
+	fi
+
+	if [[ ! -d "$MOUNT_POINT" ]]; then
+		echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	discover_debugfs
+	TRIGGER_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger"
+	STATUS_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status"
+
+	log "CephFS client reset corner case tests"
+	log "Mount: $MOUNT_POINT"
+	log "Client: $DEBUGFS_CLIENT"
+	echo ""
+
+	test_ebusy_rejection
+	test_dirty_caps_at_reset
+	test_flock_after_reset
+	test_unmount_during_reset
+
+	echo ""
+	echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skippe=
d (of $TOTAL)"
+
+	if [[ "$FAIL_COUNT" -gt 0 ]]; then
+		exit 1
+	fi
+	exit 0
+}
+
+main "$@"
diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/too=
ls/testing/selftests/filesystems/ceph/reset_stress.sh
new file mode 100755
index 000000000000..c503c75a5f7a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
@@ -0,0 +1,694 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS reset stress test:
+# - Runs concurrent I/O and rename workloads
+# - Triggers random client resets through debugfs
+# - Validates consistency and recovery behavior
+
+set -euo pipefail
+
+KSFT_SKIP=3D4
+SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+PROFILE=3D"moderate"
+DURATION_SEC=3D""
+COOLDOWN_SEC=3D20
+FILE_COUNT=3D64
+IO_WORKERS=3D""
+RENAME_WORKERS=3D""
+MOUNT_POINT=3D""
+OUT_DIR=3D""
+CLIENT_ID=3D""
+DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph"
+SLO_SECONDS=3D30
+EXPECT_RESET=3D1
+DMESG_CMD=3D""
+SUDO=3D""
+
+RESET_MIN_SEC=3D5
+RESET_MAX_SEC=3D15
+
+RUN_ID=3D"$(date +%Y%m%d-%H%M%S)"
+WORKLOAD_FLAG=3D""
+RESET_FLAG=3D""
+DATA_DIR=3D""
+
+IO_LOG=3D""
+RENAME_LOG=3D""
+RESET_LOG=3D""
+STATUS_LOG=3D""
+STATUS_BEFORE=3D""
+STATUS_FINAL=3D""
+DMESG_LOG=3D""
+SUMMARY_LOG=3D""
+REPORT_JSON=3D""
+
+RESET_PID=3D0
+STATUS_PID=3D0
+declare -a IO_WORKER_PIDS=3D()
+declare -a RENAME_WORKER_PIDS=3D()
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+  --mount-point PATH       CephFS mount point to test under
+
+Options:
+  --profile NAME           baseline|moderate|aggressive|soak (default: mod=
erate)
+  --duration-sec N         Override profile runtime in seconds
+  --cooldown-sec N         Workload drain time after injector stop (defaul=
t: 20)
+  --file-count N           Number of logical files (default: 64)
+  --io-workers N           Number of concurrent I/O workers (profile defau=
lt)
+  --rename-workers N       Number of concurrent rename workers (profile de=
fault)
+  --out-dir PATH           Artifact directory (default: /tmp/ceph_reset_st=
ress_<ts>)
+  --client-id ID           Ceph debugfs client id; auto-detect if one clie=
nt exists
+  --debugfs-root PATH      Debugfs Ceph root (default: /sys/kernel/debug/c=
eph)
+  --slo-seconds N          Max allowed post-reset stall window (default: 3=
0)
+  --no-reset               Disable reset injector (baseline mode helper)
+  --help                   Show this message
+
+Examples:
+  $0 --mount-point /mnt/cephfs --profile moderate
+  $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300
+  $0 --mount-point /mnt/cephfs --profile baseline --no-reset
+EOF
+}
+
+now_ms()
+{
+	date +%s%3N
+}
+
+set_profile_defaults()
+{
+	case "$PROFILE" in
+	baseline)
+		RESET_MIN_SEC=3D0
+		RESET_MAX_SEC=3D0
+		EXPECT_RESET=3D0
+		: "${DURATION_SEC:=3D600}"
+		: "${IO_WORKERS:=3D1}"
+		: "${RENAME_WORKERS:=3D1}"
+		;;
+	moderate)
+		RESET_MIN_SEC=3D5
+		RESET_MAX_SEC=3D15
+		: "${DURATION_SEC:=3D900}"
+		: "${IO_WORKERS:=3D2}"
+		: "${RENAME_WORKERS:=3D1}"
+		;;
+	aggressive)
+		RESET_MIN_SEC=3D1
+		RESET_MAX_SEC=3D5
+		: "${DURATION_SEC:=3D900}"
+		: "${IO_WORKERS:=3D4}"
+		: "${RENAME_WORKERS:=3D2}"
+		;;
+	soak)
+		RESET_MIN_SEC=3D5
+		RESET_MAX_SEC=3D15
+		: "${DURATION_SEC:=3D3600}"
+		: "${IO_WORKERS:=3D2}"
+		: "${RENAME_WORKERS:=3D1}"
+		;;
+	*)
+		echo "Unknown profile: $PROFILE" >&2
+		exit 2
+		;;
+	esac
+}
+
+log_summary()
+{
+	local msg=3D"$1"
+	printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUM=
MARY_LOG"
+}
+
+discover_client_id()
+{
+	local candidates=3D()
+	local entry
+
+	if [[ -n "$CLIENT_ID" ]]; then
+		if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then
+			echo "SKIP: reset debugfs not found for client-id=3D$CLIENT_ID" >&2
+			exit "$KSFT_SKIP"
+		fi
+		return 0
+	fi
+
+	if ! $SUDO test -d "$DEBUGFS_ROOT"; then
+		echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	while IFS=3D read -r entry; do
+		$SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue
+		$SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue
+		candidates+=3D("$entry")
+	done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true)
+
+	if [[ ${#candidates[@]} -eq 1 ]]; then
+		CLIENT_ID=3D"${candidates[0]}"
+		return 0
+	fi
+
+	if [[ ${#candidates[@]} -eq 0 ]]; then
+		echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" =
>&2
+		exit "$KSFT_SKIP"
+	fi
+
+	echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-=
id." >&2
+	exit "$KSFT_SKIP"
+}
+
+init_dataset()
+{
+	local i
+	mkdir -p "$DATA_DIR/A" "$DATA_DIR/B"
+
+	for ((i =3D 0; i < FILE_COUNT; i++)); do
+		printf 'seed logical_id=3D%05d ts_ms=3D%s\n' "$i" "$(now_ms)" > "$DATA_D=
IR/A/file_$(printf '%05d' "$i")"
+	done
+}
+
+io_worker()
+{
+	set +e
+	local worker_id=3D"$1"
+	local seq=3D0
+	local id
+	local relpath
+	local abspath
+	local payload
+	local hash
+	local ts
+
+	while [[ -f "$WORKLOAD_FLAG" ]]; do
+		id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+		if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+			relpath=3D"A/file_$id"
+		elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+			relpath=3D"B/file_$id"
+		else
+			sleep 0.02
+			continue
+		fi
+
+		abspath=3D"$DATA_DIR/$relpath"
+		alt_relpath=3D""
+		if [[ "$relpath" =3D=3D A/* ]]; then
+			alt_relpath=3D"B/file_$id"
+		else
+			alt_relpath=3D"A/file_$id"
+		fi
+		alt_abspath=3D"$DATA_DIR/$alt_relpath"
+		payload=3D"worker=3D${worker_id} io_seq=3D${seq} id=3D${id} ts_ms=3D$(no=
w_ms)"
+		result=3D"$(
+			python3 - "$abspath" "$alt_abspath" "$payload" <<'PY'
+import hashlib
+import os
+import sys
+
+path =3D sys.argv[1]
+alt_path =3D sys.argv[2]
+payload =3D sys.argv[3]
+
+try:
+    fd =3D os.open(path, os.O_RDWR | os.O_APPEND)
+    actual =3D path
+except FileNotFoundError:
+    try:
+        fd =3D os.open(alt_path, os.O_RDWR | os.O_APPEND)
+        actual =3D alt_path
+    except FileNotFoundError:
+        sys.exit(1)
+
+try:
+    os.write(fd, (payload + "\n").encode())
+    os.fsync(fd)
+    os.lseek(fd, 0, os.SEEK_SET)
+    digest =3D hashlib.sha256()
+    while True:
+        chunk =3D os.read(fd, 1 << 20)
+        if not chunk:
+            break
+        digest.update(chunk)
+    print(actual + " " + digest.hexdigest())
+finally:
+    os.close(fd)
+PY
+		)" || {
+			sleep 0.02
+			continue
+		}
+
+		actual_abspath=3D"${result%% *}"
+		hash=3D"${result#* }"
+		if [[ "$actual_abspath" =3D=3D "$alt_abspath" ]]; then
+			relpath=3D"$alt_relpath"
+		fi
+
+		ts=3D"$(now_ms)"
+		printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_=
LOG"
+		seq=3D$((seq + 1))
+		sleep 0.02
+	done
+}
+
+rename_worker()
+{
+	set +e
+	local worker_id=3D"$1"
+	local seq=3D0
+	local id
+	local src_rel
+	local dst_rel
+	local rc
+	local ts
+
+	while [[ -f "$WORKLOAD_FLAG" ]]; do
+		id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+
+		if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+			src_rel=3D"A/file_$id"
+			dst_rel=3D"B/file_$id"
+		elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+			src_rel=3D"B/file_$id"
+			dst_rel=3D"A/file_$id"
+		else
+			sleep 0.02
+			continue
+		fi
+
+		ts=3D"$(now_ms)"
+		if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then
+			rc=3D0
+		else
+			rc=3D$?
+		fi
+		printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_re=
l" "$dst_rel" "$rc" >> "$RENAME_LOG"
+		seq=3D$((seq + 1))
+		sleep 0.02
+	done
+}
+
+random_sleep_seconds()
+{
+	local min_sec=3D"$1"
+	local max_sec=3D"$2"
+	local wait_sec
+	local span
+
+	span=3D$((max_sec - min_sec + 1))
+	wait_sec=3D$((min_sec + RANDOM % span))
+	sleep "$wait_sec"
+}
+
+reset_injector()
+{
+	set +e
+	local trigger_path=3D"$1"
+	local seq=3D0
+	local ts
+	local reason
+	local rc
+
+	while [[ -f "$RESET_FLAG" ]]; do
+		random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC"
+		[[ -f "$RESET_FLAG" ]] || break
+
+		ts=3D"$(now_ms)"
+		reason=3D"stress_${seq}_${ts}"
+		if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then
+			rc=3D0
+		else
+			rc=3D$?
+		fi
+		printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG"
+		seq=3D$((seq + 1))
+	done
+}
+
+status_sampler()
+{
+	set +e
+	local status_path=3D"$1"
+	local ts
+	local kv_line
+
+	while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do
+		ts=3D"$(now_ms)"
+		if $SUDO test -r "$status_path"; then
+			kv_line=3D"$($SUDO awk -F': ' 'NF>=3D2 {gsub(/ /, "", $1); gsub(/ /, ""=
, $2); printf "%s=3D%s;", $1, $2}' "$status_path")"
+			printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG"
+		fi
+		sleep 1
+	done
+}
+
+stop_pid_with_timeout()
+{
+	local pid=3D"$1"
+	local name=3D"$2"
+	local timeout=3D"$3"
+	local waited=3D0
+
+	if [[ "$pid" -le 0 ]]; then
+		return 0
+	fi
+
+	while kill -0 "$pid" 2>/dev/null; do
+		if (( waited >=3D timeout )); then
+			log_summary "Timeout waiting for $name (pid=3D$pid), sending SIGTERM/SI=
GKILL"
+			kill -TERM "$pid" 2>/dev/null || true
+			sleep 1
+			kill -KILL "$pid" 2>/dev/null || true
+			wait "$pid" 2>/dev/null || true
+			return 1
+		fi
+		sleep 1
+		waited=3D$((waited + 1))
+	done
+
+	wait "$pid" 2>/dev/null || true
+	return 0
+}
+
+detect_privileges()
+{
+	if [[ -r "$DEBUGFS_ROOT" ]]; then
+		SUDO=3D""
+	elif sudo -n true 2>/dev/null; then
+		SUDO=3D"sudo"
+	else
+		echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is no=
t available" >&2
+		echo "WARNING: reset injection, debugfs status checks, and dmesg capture=
 will not work" >&2
+	fi
+
+	if $SUDO dmesg > /dev/null 2>&1; then
+		DMESG_CMD=3D"$SUDO dmesg"
+	else
+		DMESG_CMD=3D""
+		echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will =
not be detected" >&2
+	fi
+}
+
+check_dmesg()
+{
+	local start_epoch=3D"$1"
+
+	if [[ -z "$DMESG_CMD" ]]; then
+		return 0
+	fi
+
+	if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then
+		if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then
+			log_summary "WARNING: dmesg capture failed unexpectedly"
+			return 0
+		fi
+		log_summary "dmesg --since unsupported; captured full dmesg"
+	fi
+
+	if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then
+		log_summary "ERROR: kernel log contains 'hung task' during test window"
+		return 1
+	fi
+
+	return 0
+}
+
+cleanup()
+{
+	rm -f "$WORKLOAD_FLAG" "$RESET_FLAG"
+	local pid
+	for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID"=
 "$STATUS_PID"; do
+		[[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true
+	done
+	wait 2>/dev/null || true
+}
+
+parse_args()
+{
+	while [[ $# -gt 0 ]]; do
+		case "$1" in
+		--mount-point)
+			MOUNT_POINT=3D"$2"
+			shift 2
+			;;
+		--profile)
+			PROFILE=3D"$2"
+			shift 2
+			;;
+		--duration-sec)
+			DURATION_SEC=3D"$2"
+			shift 2
+			;;
+		--cooldown-sec)
+			COOLDOWN_SEC=3D"$2"
+			shift 2
+			;;
+		--file-count)
+			FILE_COUNT=3D"$2"
+			shift 2
+			;;
+		--io-workers)
+			IO_WORKERS=3D"$2"
+			shift 2
+			;;
+		--rename-workers)
+			RENAME_WORKERS=3D"$2"
+			shift 2
+			;;
+		--out-dir)
+			OUT_DIR=3D"$2"
+			shift 2
+			;;
+		--client-id)
+			CLIENT_ID=3D"$2"
+			shift 2
+			;;
+		--debugfs-root)
+			DEBUGFS_ROOT=3D"$2"
+			shift 2
+			;;
+		--slo-seconds)
+			SLO_SECONDS=3D"$2"
+			shift 2
+			;;
+		--no-reset)
+			EXPECT_RESET=3D0
+			shift
+			;;
+		--help|-h)
+			usage
+			exit 0
+			;;
+		*)
+			echo "Unknown option: $1" >&2
+			usage
+			exit 2
+			;;
+		esac
+	done
+}
+
+main()
+{
+	local start_epoch
+	local trigger_path=3D""
+	local status_path=3D""
+	local final_rc=3D0
+	local reset_enabled=3D0
+	local i
+
+	parse_args "$@"
+
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "--mount-point is required" >&2
+		usage
+		exit 2
+	fi
+
+	if [[ ! -d "$MOUNT_POINT" ]]; then
+		echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then
+		echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+	rm -f "$MOUNT_POINT/.ceph_reset_test_probe"
+
+	if ! command -v python3 > /dev/null 2>&1; then
+		echo "SKIP: python3 is required but not found in PATH" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then
+		echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2
+	fi
+
+	detect_privileges
+
+	set_profile_defaults
+	if [[ "$EXPECT_RESET" -eq 0 ]]; then
+		PROFILE=3D"baseline"
+		RESET_MIN_SEC=3D0
+		RESET_MAX_SEC=3D0
+	fi
+
+	if ! [[ "$IO_WORKERS" =3D~ ^[0-9]+$ && "$RENAME_WORKERS" =3D~ ^[0-9]+$ ]]=
; then
+		echo "io-workers and rename-workers must be integers" >&2
+		exit 2
+	fi
+
+	if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then
+		echo "io-workers and rename-workers must be > 0" >&2
+		exit 2
+	fi
+
+	if [[ -z "$OUT_DIR" ]]; then
+		OUT_DIR=3D"/tmp/ceph_reset_stress_${RUN_ID}"
+	fi
+	mkdir -p "$OUT_DIR"
+
+	WORKLOAD_FLAG=3D"$OUT_DIR/workload.running"
+	RESET_FLAG=3D"$OUT_DIR/reset.running"
+
+	DATA_DIR=3D"$MOUNT_POINT/ceph_reset_stress_${RUN_ID}"
+	mkdir -p "$DATA_DIR"
+
+	IO_LOG=3D"$OUT_DIR/io.log"
+	RENAME_LOG=3D"$OUT_DIR/rename.log"
+	RESET_LOG=3D"$OUT_DIR/reset.log"
+	STATUS_LOG=3D"$OUT_DIR/status.log"
+	STATUS_BEFORE=3D"$OUT_DIR/reset_status.before"
+	STATUS_FINAL=3D"$OUT_DIR/reset_status.final"
+	DMESG_LOG=3D"$OUT_DIR/dmesg.log"
+	SUMMARY_LOG=3D"$OUT_DIR/summary.log"
+	REPORT_JSON=3D"$OUT_DIR/validator_report.json"
+
+	: > "$IO_LOG"
+	: > "$RENAME_LOG"
+	: > "$RESET_LOG"
+	: > "$STATUS_LOG"
+	: > "$SUMMARY_LOG"
+
+	start_epoch=3D"$(date +%s)"
+
+	log_summary "Starting Ceph reset stress test"
+	log_summary "Profile=3D$PROFILE duration=3D${DURATION_SEC}s cooldown=3D${=
COOLDOWN_SEC}s file_count=3D${FILE_COUNT} io_workers=3D${IO_WORKERS} rename=
_workers=3D${RENAME_WORKERS}"
+	[[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations"
+	[[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung =
task detection disabled"
+	log_summary "Artifacts=3D$OUT_DIR"
+	log_summary "Data dir=3D$DATA_DIR"
+
+	init_dataset
+
+	if [[ "$EXPECT_RESET" -eq 1 ]]; then
+		discover_client_id
+		trigger_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger"
+		status_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+		if ! $SUDO test -w "$trigger_path"; then
+			echo "SKIP: Reset trigger is not writable: $trigger_path" >&2
+			exit "$KSFT_SKIP"
+		fi
+		if ! $SUDO test -r "$status_path"; then
+			echo "SKIP: Reset status is not readable: $status_path" >&2
+			exit "$KSFT_SKIP"
+		fi
+		$SUDO cat "$status_path" > "$STATUS_BEFORE" || true
+		reset_enabled=3D1
+		log_summary "Using ceph client id: $CLIENT_ID"
+	fi
+
+	trap cleanup EXIT INT TERM
+
+	touch "$WORKLOAD_FLAG"
+	for ((i =3D 0; i < IO_WORKERS; i++)); do
+		io_worker "$i" &
+		IO_WORKER_PIDS+=3D("$!")
+	done
+
+	for ((i =3D 0; i < RENAME_WORKERS; i++)); do
+		rename_worker "$i" &
+		RENAME_WORKER_PIDS+=3D("$!")
+	done
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		touch "$RESET_FLAG"
+		reset_injector "$trigger_path" &
+		RESET_PID=3D$!
+
+		status_sampler "$status_path" &
+		STATUS_PID=3D$!
+	fi
+
+	sleep "$DURATION_SEC"
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		rm -f "$RESET_FLAG"
+		stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=3D1
+		log_summary "Injector stopped; entering cooldown=3D${COOLDOWN_SEC}s"
+	fi
+
+	sleep "$COOLDOWN_SEC"
+
+	rm -f "$WORKLOAD_FLAG"
+	for i in "${!IO_WORKER_PIDS[@]}"; do
+		stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || fina=
l_rc=3D1
+	done
+	for i in "${!RENAME_WORKER_PIDS[@]}"; do
+		stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20=
 || final_rc=3D1
+	done
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=3D1
+		$SUDO cat "$status_path" > "$STATUS_FINAL" || true
+	fi
+
+	if ! check_dmesg "$start_epoch"; then
+		final_rc=3D1
+	fi
+
+	if ! python3 "$SCRIPT_DIR/validate_consistency.py" \
+		--data-dir "$DATA_DIR" \
+		--file-count "$FILE_COUNT" \
+		--io-log "$IO_LOG" \
+		--rename-log "$RENAME_LOG" \
+		--reset-log "$RESET_LOG" \
+		--status-final "$STATUS_FINAL" \
+		--slo-seconds "$SLO_SECONDS" \
+		--report-json "$REPORT_JSON" \
+		$( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then
+		final_rc=3D1
+	fi
+
+	if [[ "$final_rc" -eq 0 ]]; then
+		log_summary "PASS: stress run completed successfully"
+	else
+		log_summary "FAIL: stress run detected one or more failures"
+	fi
+
+	log_summary "Artifacts available in: $OUT_DIR"
+	exit "$final_rc"
+}
+
+main "$@"
diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/t=
ools/testing/selftests/filesystems/ceph/run_validation.sh
new file mode 100755
index 000000000000..5d521e4f9e9b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh
@@ -0,0 +1,350 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset - single-command validation.
+# Runs all test stages in sequence with per-stage timeouts.
+# If any stage hangs (filesystem stuck, process blocked), the
+# timeout kills it and reports failure.
+#
+# Usage:
+#   sudo ./run_validation.sh --mount-point /mnt/mycephfs
+#
+# Expected output on success:
+#
+#   =3D=3D=3D CephFS Client Reset Validation =3D=3D=3D
+#   [stage 1/5] baseline         PASS  (60s, no resets)
+#   [stage 2/5] corner_cases     PASS  (4/4 passed)
+#   [stage 3/5] moderate         PASS  (120s, resets every 5-15s)
+#   [stage 4/5] aggressive       PASS  (120s, resets every 1-5s)
+#   [stage 5/5] status_check     PASS  (phase=3Didle, last_errno=3D0)
+#
+#   RESULT: 5/5 stages passed
+#   Artifacts: /tmp/ceph_reset_validation_<timestamp>
+
+set -uo pipefail
+
+KSFT_SKIP=3D4
+SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=3D""
+CLIENT_ID=3D""
+declare -a CLIENT_ARGS=3D()
+declare -a DEBUGFS_ARGS=3D()
+RUN_ID=3D"$(date +%Y%m%d-%H%M%S)"
+OUT_DIR=3D"/tmp/ceph_reset_validation_${RUN_ID}"
+DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph"
+
+# Timeout margins: stage runtime + cooldown + validation + safety buffer
+STAGE1_TIMEOUT=3D120    # 60s run + 20s cooldown + 40s buffer
+STAGE2_TIMEOUT=3D300    # 4 corner cases, 30s each worst case + buffer
+STAGE3_TIMEOUT=3D240    # 120s run + 20s cooldown + 100s buffer
+STAGE4_TIMEOUT=3D240    # 120s run + 20s cooldown + 100s buffer
+STAGE5_TIMEOUT=3D10     # just reading debugfs
+
+PASS=3D0
+FAIL=3D0
+TOTAL=3D5
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+  --mount-point PATH    CephFS mount point
+
+Options:
+  --out-dir PATH        Artifact directory (default: /tmp/ceph_reset_valid=
ation_<ts>)
+  --client-id ID        Ceph debugfs client id (optional)
+  --debugfs-root PATH   Debugfs Ceph root (default: /sys/kernel/debug/ceph)
+  --help                Show this message
+EOF
+}
+
+stage_result()
+{
+	local num=3D"$1"
+	local name=3D"$2"
+	local status=3D"$3"
+	local detail=3D"$4"
+
+	if [[ "$status" =3D=3D "PASS" ]]; then
+		PASS=3D$((PASS + 1))
+	else
+		FAIL=3D$((FAIL + 1))
+	fi
+	printf '[stage %d/%d] %-16s %s  (%s)\n' "$num" "$TOTAL" "$name" "$status"=
 "$detail"
+}
+
+# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout.
+# Sets RUN_TIMED_OUT=3D1 if killed by timeout.
+#
+# The stage command runs in its own session/process group (via setsid).
+# On timeout the entire process group is killed, not just the top-level
+# script PID.  This is required because stage scripts (reset_stress.sh,
+# reset_corner_cases.sh) spawn child processes - I/O workers, rename
+# workers, reset injectors, samplers - that would otherwise survive the
+# timeout and bleed into later stages, invalidating results.
+RUN_TIMED_OUT=3D0
+
+run_with_timeout()
+{
+	local timeout_sec=3D"$1"
+	local logfile=3D"$2"
+	shift 2
+
+	RUN_TIMED_OUT=3D0
+
+	# Start the stage in its own session via setsid so all descendant
+	# processes share a process group that we can kill atomically.
+	# In a non-interactive script, background children are not process
+	# group leaders, so setsid(1) calls setsid(2) directly (no extra
+	# fork) and the PID we capture IS the group leader.
+	setsid "$@" > "$logfile" 2>&1 &
+	local pid=3D$!
+
+	# Watchdog: on timeout, kill the entire process group
+	(
+		sleep "$timeout_sec"
+		if kill -0 "$pid" 2>/dev/null; then
+			echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $p=
id" >> "$logfile"
+			kill -TERM -- -"$pid" 2>/dev/null
+			sleep 2
+			kill -KILL -- -"$pid" 2>/dev/null
+		fi
+	) &
+	local watchdog_pid=3D$!
+
+	# Wait for the stage command
+	wait "$pid" 2>/dev/null
+	local rc=3D$?
+
+	# Kill the watchdog if it's still running
+	kill "$watchdog_pid" 2>/dev/null
+	wait "$watchdog_pid" 2>/dev/null
+
+	# Check if it was killed by timeout
+	if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then
+		RUN_TIMED_OUT=3D1
+		return 1
+	fi
+
+	return "$rc"
+}
+
+find_status_path()
+{
+	local entry
+
+	if [[ -n "$CLIENT_ID" ]]; then
+		if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then
+			echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+			return 0
+		fi
+		return 1
+	fi
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		if [[ -r "${entry}reset/status" ]]; then
+			echo "${entry}reset/status"
+			return 0
+		fi
+	done
+	return 1
+}
+
+read_status_field()
+{
+	local status_path=3D"$1"
+	local field=3D"$2"
+	awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$status_path" 2>=
/dev/null
+}
+
+# --- Parse arguments ----------------------------------------------------=
---
+
+while [[ $# -gt 0 ]]; do
+	case "$1" in
+	--mount-point)  MOUNT_POINT=3D"$2"; shift 2 ;;
+	--out-dir)      OUT_DIR=3D"$2"; shift 2 ;;
+	--client-id)    CLIENT_ID=3D"$2"; shift 2 ;;
+	--debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;;
+	--help|-h)      usage; exit 0 ;;
+	*)              echo "Unknown option: $1" >&2; usage; exit 2 ;;
+	esac
+done
+
+if [[ -z "$MOUNT_POINT" ]]; then
+	echo "SKIP: --mount-point is required" >&2
+	usage
+	exit "$KSFT_SKIP"
+fi
+
+if [[ ! -d "$MOUNT_POINT" ]]; then
+	echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+	exit "$KSFT_SKIP"
+fi
+
+# Auto-detect client id when not specified, so all stages (including
+# stage 5 status check) use the same client consistently.
+if [[ -z "$CLIENT_ID" ]]; then
+	candidates=3D()
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		name=3D"$(basename "$entry")"
+		if [[ -r "${entry}reset/status" ]]; then
+			candidates+=3D("$name")
+		fi
+	done
+	if [[ ${#candidates[@]} -eq 1 ]]; then
+		CLIENT_ID=3D"${candidates[0]}"
+	elif [[ ${#candidates[@]} -gt 1 ]]; then
+		echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client=
-id." >&2
+		exit "$KSFT_SKIP"
+	fi
+fi
+
+if [[ -n "$CLIENT_ID" ]]; then
+	CLIENT_ARGS=3D(--client-id "$CLIENT_ID")
+fi
+DEBUGFS_ARGS=3D(--debugfs-root "$DEBUGFS_ROOT")
+
+# Quick sanity: can we write to the mount?
+if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then
+	echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+	exit "$KSFT_SKIP"
+fi
+rm -f "$MOUNT_POINT/.validation_probe_$$"
+
+mkdir -p "$OUT_DIR"
+
+echo ""
+echo "=3D=3D=3D CephFS Client Reset Validation =3D=3D=3D"
+echo ""
+
+# --- Stage 1: Baseline (no resets) --------------------------------------=
---
+
+stage1_out=3D"$OUT_DIR/stage1_baseline"
+if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile baseline \
+	--no-reset \
+	--duration-sec 60 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage1_out"; then
+	stage_result 1 "baseline" "PASS" "60s, no resets"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s"
+else
+	stage_result 1 "baseline" "FAIL" "see $stage1_out.log"
+fi
+
+# --- Stage 2: Corner cases ----------------------------------------------=
---
+
+stage2_out=3D"$OUT_DIR/stage2_corner_cases"
+mkdir -p "$stage2_out"
+if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \
+	"$SCRIPT_DIR/reset_corner_cases.sh" \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--mount-point "$MOUNT_POINT"; then
+	pass_line=3D$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$s=
tage2_out/output.log" | tail -1)
+	stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT=
}s"
+else
+	fail_line=3D$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo=
 "?")
+	stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_=
out/output.log"
+fi
+
+# --- Stage 3: Moderate resets -------------------------------------------=
----
+
+stage3_out=3D"$OUT_DIR/stage3_moderate"
+if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile moderate \
+	--duration-sec 120 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage3_out"; then
+	stage_result 3 "moderate" "PASS" "120s, resets every 5-15s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s"
+else
+	stage_result 3 "moderate" "FAIL" "see $stage3_out.log"
+fi
+
+# --- Stage 4: Aggressive resets -----------------------------------------=
----
+
+stage4_out=3D"$OUT_DIR/stage4_aggressive"
+if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile aggressive \
+	--duration-sec 120 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage4_out"; then
+	stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s"
+else
+	stage_result 4 "aggressive" "FAIL" "see $stage4_out.log"
+fi
+
+# --- Stage 5: Post-run status check -------------------------------------=
---
+
+status_path=3D""
+if status_path=3D$(find_status_path); then
+	phase=3D$(read_status_field "$status_path" "phase")
+	last_errno=3D$(read_status_field "$status_path" "last_errno")
+	failure_count=3D$(read_status_field "$status_path" "failure_count")
+	drain_timed_out=3D$(read_status_field "$status_path" "drain_timed_out")
+	sessions_reset=3D$(read_status_field "$status_path" "sessions_reset")
+	blocked=3D$(read_status_field "$status_path" "blocked_requests")
+
+	# Save full status
+	cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null
+
+	errors=3D""
+	[[ "$phase" !=3D "idle" ]] && errors=3D"${errors}phase=3D$phase "
+	[[ "$last_errno" !=3D "0" ]] && errors=3D"${errors}last_errno=3D$last_err=
no "
+	[[ "$failure_count" !=3D "0" && -n "$failure_count" ]] && errors=3D"${err=
ors}failure_count=3D$failure_count "
+	[[ "$blocked" !=3D "0" ]] && errors=3D"${errors}blocked_requests=3D$block=
ed "
+
+	if [[ -z "$errors" ]]; then
+		detail=3D"phase=3D$phase, last_errno=3D$last_errno, failure_count=3D${fa=
ilure_count:-0}"
+		[[ "$drain_timed_out" =3D=3D "yes" ]] && detail=3D"$detail, drain_timed_=
out=3Dyes"
+		[[ -n "$sessions_reset" ]] && detail=3D"$detail, sessions_reset=3D$sessi=
ons_reset"
+		stage_result 5 "status_check" "PASS" "$detail"
+	else
+		stage_result 5 "status_check" "FAIL" "$errors"
+	fi
+else
+	stage_result 5 "status_check" "FAIL" "cannot read reset/status"
+fi
+
+# --- Summary ------------------------------------------------------------=
----
+
+echo ""
+if [[ "$FAIL" -eq 0 ]]; then
+	echo "RESULT: $PASS/$TOTAL stages passed"
+else
+	echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED"
+fi
+echo "Artifacts: $OUT_DIR"
+echo ""
+
+exit "$FAIL"
diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/test=
ing/selftests/filesystems/ceph/settings
new file mode 100644
index 000000000000..79b65bdf05db
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/settings
@@ -0,0 +1 @@
+timeout=3D1200
diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.=
py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
new file mode 100755
index 000000000000..c230a59bdb3a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+import bisect
+import hashlib
+import json
+import os
+from pathlib import Path
+
+
+def sha256_file(path: Path) -> str:
+    digest =3D hashlib.sha256()
+    with path.open("rb") as handle:
+        while True:
+            chunk =3D handle.read(1 << 20)
+            if not chunk:
+                break
+            digest.update(chunk)
+    return digest.hexdigest()
+
+
+def parse_io_log(path: Path):
+    records =3D []
+    if not path.exists():
+        return records
+    with path.open("r", encoding=3D"utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line =3D line.strip()
+            if not line:
+                continue
+            parts =3D line.split(",")
+            if len(parts) !=3D 5:
+                raise ValueError(f"io log line {line_no}: expected 5 colum=
ns, got {len(parts)}")
+            ts_ms, seq, logical_id, relpath, digest =3D parts
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "logical_id": int(logical_id),
+                    "relpath": relpath,
+                    "digest": digest,
+                }
+            )
+    return records
+
+
+def parse_rename_log(path: Path):
+    records =3D []
+    if not path.exists():
+        return records
+    with path.open("r", encoding=3D"utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line =3D line.strip()
+            if not line:
+                continue
+            parts =3D line.split(",")
+            if len(parts) =3D=3D 6:
+                ts_ms, seq, logical_id, src_rel, dst_rel, rc =3D parts
+            elif len(parts) =3D=3D 7:
+                ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc =
=3D parts
+                _ =3D worker_id  # worker id is informational only
+            else:
+                raise ValueError(
+                    f"rename log line {line_no}: expected 6 or 7 columns, =
got {len(parts)}"
+                )
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "logical_id": int(logical_id),
+                    "src_rel": src_rel,
+                    "dst_rel": dst_rel,
+                    "rc": int(rc),
+                }
+            )
+    return records
+
+
+def parse_reset_log(path: Path):
+    records =3D []
+    if not path.exists():
+        return records
+    with path.open("r", encoding=3D"utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line =3D line.strip()
+            if not line:
+                continue
+            parts =3D line.split(",")
+            if len(parts) !=3D 4:
+                raise ValueError(f"reset log line {line_no}: expected 4 co=
lumns, got {len(parts)}")
+            ts_ms, seq, reason, rc =3D parts
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "reason": reason,
+                    "rc": int(rc),
+                }
+            )
+    return records
+
+
+def parse_status_file(path: Path):
+    status =3D {}
+    if not path.exists():
+        return status
+    with path.open("r", encoding=3D"utf-8") as handle:
+        for line in handle:
+            line =3D line.strip()
+            if not line or ":" not in line:
+                continue
+            key, value =3D line.split(":", 1)
+            status[key.strip()] =3D value.strip()
+    return status
+
+
+def to_int(value: str, default: int =3D 0):
+    try:
+        return int(value)
+    except Exception:
+        return default
+
+
+def validate_namespace(data_dir: Path, file_count: int, issues):
+    actual_locations =3D {}
+    actual_paths =3D {}
+    for logical_id in range(file_count):
+        name =3D f"file_{logical_id:05d}"
+        found =3D []
+        for subdir in ("A", "B"):
+            candidate =3D data_dir / subdir / name
+            if candidate.exists():
+                found.append((subdir, candidate))
+        if len(found) !=3D 1:
+            issues.append(
+                f"namespace invariant failed for logical_id=3D{logical_id:=
05d}: expected exactly one file in A/B, found {len(found)}"
+            )
+            continue
+        actual_locations[logical_id] =3D found[0][0]
+        actual_paths[logical_id] =3D found[0][1]
+    return actual_locations, actual_paths
+
+
+def validate_rename_invariant(rename_records, actual_locations, issues):
+    expected_locations =3D {}
+    for rec in rename_records:
+        if rec["rc"] !=3D 0:
+            continue
+        dst =3D rec["dst_rel"]
+        if "/" not in dst:
+            continue
+        expected_locations[rec["logical_id"]] =3D dst.split("/", 1)[0]
+
+    for logical_id, expected in expected_locations.items():
+        actual =3D actual_locations.get(logical_id)
+        if actual is None:
+            continue
+        if actual !=3D expected:
+            issues.append(
+                f"rename invariant failed for logical_id=3D{logical_id:05d=
}: expected location=3D{expected}, actual=3D{actual}"
+            )
+
+
+def validate_data_invariant(io_records, actual_paths, issues):
+    expected_hash =3D {}
+    for rec in io_records:
+        digest =3D rec["digest"]
+        if not digest:
+            continue
+        expected_hash[rec["logical_id"]] =3D digest
+
+    for logical_id, digest in expected_hash.items():
+        path =3D actual_paths.get(logical_id)
+        if path is None:
+            continue
+        actual_digest =3D sha256_file(path)
+        if digest !=3D actual_digest:
+            issues.append(
+                f"data invariant failed for logical_id=3D{logical_id:05d}:=
 expected digest=3D{digest}, actual digest=3D{actual_digest}"
+            )
+
+
+def validate_reset_and_slo(args, reset_records, io_records, rename_records=
, status, issues):
+    if not args.expect_reset:
+        return
+
+    successful_reset_times =3D [rec["ts_ms"] for rec in reset_records if r=
ec["rc"] =3D=3D 0]
+    if not successful_reset_times:
+        issues.append("expected reset activity but no successful reset tri=
gger was observed")
+
+    phase =3D status.get("phase")
+    blocked_requests =3D to_int(status.get("blocked_requests", "0"), defau=
lt=3D-1)
+    last_errno =3D to_int(status.get("last_errno", "0"), default=3D1)
+    failure_count =3D to_int(status.get("failure_count", "0"), default=3D-=
1)
+
+    if phase is None:
+        issues.append("missing final reset status file or phase field")
+    elif phase.lower() !=3D "idle":
+        issues.append(f"recovery invariant failed: phase=3D{phase}, expect=
ed idle")
+
+    if blocked_requests !=3D 0:
+        issues.append(f"recovery invariant failed: blocked_requests=3D{blo=
cked_requests}, expected 0")
+    if last_errno !=3D 0:
+        issues.append(f"recovery invariant failed: last_errno=3D{last_errn=
o}, expected 0")
+    if failure_count > 0:
+        issues.append(
+            f"recovery invariant failed: failure_count=3D{failure_count}, "
+            "one or more resets failed during the run"
+        )
+
+    op_times =3D [rec["ts_ms"] for rec in io_records]
+    op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] =
=3D=3D 0)
+    op_times.sort()
+
+    if successful_reset_times and not op_times:
+        issues.append("recovery SLO failed: no workload completion events =
were recorded")
+        return
+
+    slo_ms =3D args.slo_seconds * 1000
+    for ts in successful_reset_times:
+        idx =3D bisect.bisect_left(op_times, ts)
+        if idx >=3D len(op_times):
+            issues.append(f"recovery SLO failed: no operation completion o=
bserved after reset at ts_ms=3D{ts}")
+            continue
+        delta =3D op_times[idx] - ts
+        if delta > slo_ms:
+            issues.append(
+                f"recovery SLO failed: first post-reset completion at {del=
ta}ms exceeds threshold {slo_ms}ms (reset ts_ms=3D{ts})"
+            )
+
+
+def main():
+    parser =3D argparse.ArgumentParser(description=3D"Validate Ceph reset =
stress artifacts")
+    parser.add_argument("--data-dir", required=3DTrue)
+    parser.add_argument("--file-count", required=3DTrue, type=3Dint)
+    parser.add_argument("--io-log", required=3DTrue)
+    parser.add_argument("--rename-log", required=3DTrue)
+    parser.add_argument("--reset-log", required=3DTrue)
+    parser.add_argument("--status-final", required=3DFalse, default=3D"")
+    parser.add_argument("--slo-seconds", required=3DFalse, type=3Dint, def=
ault=3D30)
+    parser.add_argument("--expect-reset", action=3D"store_true")
+    parser.add_argument("--report-json", required=3DFalse, default=3D"")
+    args =3D parser.parse_args()
+
+    data_dir =3D Path(args.data_dir)
+    io_log =3D Path(args.io_log)
+    rename_log =3D Path(args.rename_log)
+    reset_log =3D Path(args.reset_log)
+    status_final =3D Path(args.status_final) if args.status_final else Pat=
h("__missing_status__")
+
+    issues =3D []
+
+    if not data_dir.exists():
+        issues.append(f"data directory is missing: {data_dir}")
+
+    try:
+        io_records =3D parse_io_log(io_log)
+        rename_records =3D parse_rename_log(rename_log)
+        reset_records =3D parse_reset_log(reset_log)
+    except Exception as exc:
+        issues.append(f"log parsing failed: {exc}")
+        io_records =3D []
+        rename_records =3D []
+        reset_records =3D []
+
+    status =3D parse_status_file(status_final)
+
+    actual_locations, actual_paths =3D validate_namespace(data_dir, args.f=
ile_count, issues)
+    validate_rename_invariant(rename_records, actual_locations, issues)
+    validate_data_invariant(io_records, actual_paths, issues)
+    validate_reset_and_slo(args, reset_records, io_records, rename_records=
, status, issues)
+
+    report =3D {
+        "file_count": args.file_count,
+        "io_records": len(io_records),
+        "rename_records": len(rename_records),
+        "reset_records": len(reset_records),
+        "expect_reset": args.expect_reset,
+        "issues": issues,
+    }
+
+    if args.report_json:
+        report_path =3D Path(args.report_json)
+        report_path.write_text(json.dumps(report, indent=3D2, sort_keys=3D=
True), encoding=3D"utf-8")
+
+    if issues:
+        print("FAIL: consistency validation found issues")
+        for issue in issues:
+            print(f"  - {issue}")
+        raise SystemExit(1)
+
+    print("PASS: consistency validation succeeded")
+
+
+if __name__ =3D=3D "__main__":
+    main()
--=20
2.34.1