From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A1073E5571 for ; Wed, 15 Apr 2026 17:01:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272480; cv=none; b=RCfYkzyDhnIKAcSUiJDA0c/jN34Fvjv1Xvr1zMok6r1Rnb88fgFpWVBAQMp0wXXKZPyIUOQSiRAq/ppaPC/lxThLoBWY43HS5jZ6U1K4wJEGnyR3nFN2Wkxc8G0nu8wMjc+jp6heuhsXM/M/eLO6OXSXULpor+EqK5V8EEnS+Jk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272480; c=relaxed/simple; bh=iPptitDCHY6iUmzXK5v2EJe1oBKCUy+jF+gFF6ntNCM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Fc2kioJA0CsdPREYjSnLIEbTINKyx3pZ30RFzRAdF5iPe8MtDTv+tmJqDu1K1qtKpMVcDJMxkYDcAq31x+Zll49CYEdZTYlD73thxTpJNtzXHxDB3ATlVjthmlU0Ee13Obe5lmfpysbgiyiFM8P+nLudjyRWlSTM+Hp2xEAhCFE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=foB3wl1E; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=sYE6ge+0; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="foB3wl1E"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="sYE6ge+0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272477; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=; b=foB3wl1E0jWDc4YomNnkdZQIiGJwZjhitaxaP5L/KPDPQiAZiBTCh1Zi4tHSKtUQhINy6y adX0FKokUDJrrOjdCaT3Qi5uDr3Mlrr1nUdlM2vlxEM8ixiur0AAfkdEP814v/qome1crB YWs5dxD/OVH8/btbi1aHCreZMxq0kIo= Received: from mail-oa1-f71.google.com (mail-oa1-f71.google.com [209.85.160.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-48-r__5RY4hOqiWU5XiyCaebA-1; Wed, 15 Apr 2026 13:01:05 -0400 X-MC-Unique: r__5RY4hOqiWU5XiyCaebA-1 X-Mimecast-MFC-AGG-ID: r__5RY4hOqiWU5XiyCaebA_1776272465 Received: by mail-oa1-f71.google.com with SMTP id 586e51a60fabf-42467c9547bso2301597fac.0 for ; Wed, 15 Apr 2026 10:01:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272465; x=1776877265; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=; b=sYE6ge+0bu/RQBx8/YP20kDuuY1JacHNCoRcfret5o3Io9fDYgE2YPXTerc1fb4Xty 8MmzhP1KOKRplVMCSyRy/asUOUYqespq8AhduI+1CvXYdxL0AqUqthv57TxjeEMbRZ1S PounRg6nLEugJzDlm4SfjS7UTjQFCEFvGSEw4qT2pvYSBqKPVSmib2kfenZvo+9xDi23 U9a6GkAy2RgdGYfYEqSAmaQTfitUh8Oog9RuwsDdGkMGEE/hTVVMCYn+b8fvkDDMt1SG PEaVY2N0utgFdLmLqSL6nQXPVOexiPtvcHu/W3v2lHcsQbyh9FX0FM+wWzSt1RjBhWKa QFSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272465; x=1776877265; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=jyX88k/jwVWT5oP8d21qPdRgBVXaGwKPazLNDD9Kmn0=; b=B21XDY9/eg1wyzL3iKbA0jhPH8wdmoYq2KbdLSLmXg29mKIF/+jyWSOu33EEluqQ1R FkJwIDtmdk0SGztx0TXZtXPmBO4Y9Sbq2iaDR6PkC6m0MDo/NyWZUxz9C3TEnFMwZBSx kd4neh1WGVoXnJd49axXgGTxhHx9ftuCqs+sofKmGkKBDBTBmpyhqpKX06KOd9ueN+yX gQgVwj0n3bh/WwGUZyXyFXILe8WthTnc0Pwk0lfQSHWGG9jQLKBDhk0EzneGoSgCgOsK +leJDNAR/7gV07Amg6G5ICbtxlv3DuHn0WEx01j2xFkIGePJbVsJabeF7tMX+8hUCb9/ m8HA== X-Gm-Message-State: AOJu0YzMhAIdJ9D5TAapGACp0x+JjiUSHRMOb7ILfh1frPtk2KNMI9bW 0JXkf4aWKM9WImirvUuYxvKXYSm89LZAkybPYQjyL/O8Hs50n+nKhw6/BgBW2hf/gjXDZIdzdZx TVl42dJol2FKWYBeizxt1Z/meG4L+e4Y/zrgLF0CDdfqd6CdbvMyVoyYszGFLqCt1NnRspzYi1b zt X-Gm-Gg: AeBDieuAnpUf/5awxEhYeNA20I/JAn5ZfXF3IDHJ7EXkhEZkvhu3D6VFaO7Cv4bnsAr NR7+WLMmTGGnKJBo+kod52a7qgVouH6Miz1t4EQW9LznoSA1HsOvkZMOAPaSm8OfaNTtimlBZN6 o2t2Woylb5BvREglsmmNqbm9XC37qEutCG5PCO3NeSYn6kTIIcgTW+Ao59ctqK0ABxzdp5LIj+R pwxirYIF3ziM8cNil58KtHlAMHeV3gPhrJRYeflOP4HaFVw0oS65VedLPEBaDte6ny2yFdhqF3/ j2wK/380lA+/pXWcPmcKjjE7ZMbjMMqfqk0Qvxdbf5KqCFPUuV7sstdY2e2Zk4N3KYzSlkPfCtr Txey9OPGFFg0E5V1JnOqbLvuX+JpnZCkjXrmBhiqPJuYmfpR7HsB7cYzBcvhN32aoDA== X-Received: by 2002:a05:6820:330d:b0:67f:c06c:a5e6 with SMTP id 006d021491bc7-68be7ee6967mr7305587eaf.37.1776272462752; Wed, 15 Apr 2026 10:01:02 -0700 (PDT) X-Received: by 2002:a05:6820:330d:b0:67f:c06c:a5e6 with SMTP id 006d021491bc7-68be7ee6967mr7305554eaf.37.1776272462033; Wed, 15 Apr 2026 10:01:02 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:01 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 1/7] ceph: convert inode flags to named bit positions and atomic bitops Date: Wed, 15 Apr 2026 17:00:37 +0000 Message-Id: <20260415170043.3882912-2-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Define named bit-position constants for all CEPH_I_* inode flags and derive the bitmask values from them. This gives every flag a named _BIT constant usable with the test_bit/set_bit/clear_bit family. The intentionally unused bit position 1 is documented inline. Convert all flag modifications to use atomic bitops (set_bit, clear_bit, test_and_clear_bit). The previous code mixed lockless atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic read-modify-write (|=3D / &=3D ~) on other flags sharing the same unsigned long. A concurrent non-atomic RMW can clobber an adjacent lockless atomic update -- for example, a lockless clear_bit(ERROR_WRITE) could be silently resurrected by a concurrent ci->i_ceph_flags |=3D CEPH_I_FLUSH under the spinlock. Using atomic bitops for all modifications eliminates this class of race entirely. Flags whose only users are now the _BIT form (ERROR_WRITE, ERROR_FILELOCK, SHUTDOWN, ASYNC_CHECK_CAPS) have their old mask defines removed to document that callers must use the _BIT constant with the set_bit/test_bit family. Flag reads under i_ceph_lock continue to use bitmask tests where the tested flag is only modified under the same lock; this is safe because the lock serialises both the read and the write. The remaining flags continue to use non-atomic bitmask operations under i_ceph_lock, which is correct and unchanged. The lockless reader ceph_inode_is_shutdown() retains the READ_ONCE() snapshot plus bitmask test pattern -- the single atomic load into a local variable is correct and avoids a second memory access that test_bit() would require. The direct assignment in ceph_finish_async_create() is converted from i_ceph_flags =3D CEPH_I_ASYNC_CREATE to set_bit(). This inode is I_NEW at this point -- still invisible to other threads and guaranteed to have zero flags from alloc_inode -- so either form is safe, but set_bit() keeps the conversion uniform. The only remaining direct assignment (alloc_inode zeroing) operates on an inode that is not yet visible to other threads, so it is safe without atomic ops. The dead precomputed flags variable in ceph_pool_perm_check() is removed; the check: loop re-reads flags from i_ceph_flags after the set_bit() calls, keeping a single source of truth. Co-developed-by: Viacheslav Dubeyko Signed-off-by: Viacheslav Dubeyko Signed-off-by: Alex Markuze --- fs/ceph/addr.c | 16 +++++------ fs/ceph/caps.c | 24 ++++++++--------- fs/ceph/file.c | 12 ++++----- fs/ceph/inode.c | 4 +-- fs/ceph/locks.c | 22 ++++----------- fs/ceph/mds_client.c | 3 ++- fs/ceph/mds_client.h | 2 +- fs/ceph/snap.c | 2 +- fs/ceph/super.h | 64 ++++++++++++++++++++++---------------------- fs/ceph/xattr.c | 2 +- 10 files changed, 69 insertions(+), 82 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 2090fc78529c..bde9efffa228 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -2583,20 +2583,18 @@ int ceph_pool_perm_check(struct inode *inode, int n= eed) if (ret < 0) return ret; =20 - flags =3D CEPH_I_POOL_PERM; - if (ret & POOL_READ) - flags |=3D CEPH_I_POOL_RD; - if (ret & POOL_WRITE) - flags |=3D CEPH_I_POOL_WR; - spin_lock(&ci->i_ceph_lock); if (pool =3D=3D ci->i_layout.pool_id && pool_ns =3D=3D rcu_dereference_raw(ci->i_layout.pool_ns)) { - ci->i_ceph_flags |=3D flags; - } else { + set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); + if (ret & POOL_READ) + set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags); + if (ret & POOL_WRITE) + set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags); + } else { pool =3D ci->i_layout.pool_id; - flags =3D ci->i_ceph_flags; } + flags =3D ci->i_ceph_flags; spin_unlock(&ci->i_ceph_lock); goto check; } diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index d51454e995a8..cb9e78b713d9 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_c= lient *mdsc, =20 doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list); @@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, stru= ct ceph_cap *cap, ceph_cap_string(revoking)); BUG_ON((retain & CEPH_CAP_PIN) =3D=3D 0); =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH; + clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); =20 cap->issued &=3D retain; /* drop bits we don't want */ /* @@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, last_tid =3D capsnap->cap_flush.tid; } =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH_SNAPS; + clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 while (first_tid <=3D last_tid) { struct ceph_cap *cap =3D ci->i_auth_cap; @@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int = flags) =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - ci->i_ceph_flags |=3D CEPH_I_ASYNC_CHECK_CAPS; + set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags); =20 /* Don't send messages until we get async create reply */ spin_unlock(&ci->i_ceph_lock); @@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_clie= nt *mdsc, if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) return; =20 - ci->i_ceph_flags &=3D ~CEPH_I_KICK_FLUSH; + clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); =20 list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) { if (cf->is_capsnap) { @@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_cl= ient *mdsc, __kick_flushing_caps(mdsc, session, ci, oldest_flush_tid); } else { - ci->i_ceph_flags |=3D CEPH_I_KICK_FLUSH; + set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); } =20 spin_unlock(&ci->i_ceph_lock); @@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int = need, int want, spin_lock(&ci->i_ceph_lock); =20 if ((flags & CHECK_FILELOCK) && - (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) { + test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { doutc(cl, "%p %llx.%llx error filelock\n", inode, ceph_vinop(inode)); ret =3D -EIO; @@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_i= nfo *ci, BUG_ON(capsnap->cap_flush.tid > 0); ceph_put_snap_context(capsnap->context); if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps)) - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 list_del(&capsnap->ci_item); ceph_put_cap_snap(capsnap); @@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_inf= o *ci, int nr, if (ceph_try_drop_cap_snap(ci, capsnap)) { put++; } else { - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); flush_snaps =3D true; } } @@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode, =20 if (ci->i_layout.pool_id !=3D old_pool || extra_info->pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 extra_info->pool_ns =3D old_ns; =20 @@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode) doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add_tail(&ci->i_cap_delay_list, @@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct = ceph_cap *cap, bool *invali =20 if (atomic_read(&ci->i_filelock_ref) > 0) { /* make further file lock syscall return -EIO */ - ci->i_ceph_flags |=3D CEPH_I_ERROR_FILELOCK; + set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); pr_warn_ratelimited_client(cl, " dropping file locks for %p %llx.%llx\n", inode, ceph_vinop(inode)); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 5e7c73a29aa3..2b457dab0837 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -579,12 +579,11 @@ static void wake_async_create_waiters(struct inode *i= node, =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags); + clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags); =20 - if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) { - ci->i_ceph_flags &=3D ~CEPH_I_ASYNC_CHECK_CAPS; + if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, + &ci->i_ceph_flags)) check_cap =3D true; - } } ceph_kick_flushing_inode_caps(session, ci); spin_unlock(&ci->i_ceph_lock); @@ -747,7 +746,8 @@ static int ceph_finish_async_create(struct inode *dir, = struct inode *inode, * that point and don't worry about setting * CEPH_I_ASYNC_CREATE. */ - ceph_inode(inode)->i_ceph_flags =3D CEPH_I_ASYNC_CREATE; + set_bit(CEPH_I_ASYNC_CREATE_BIT, + &ceph_inode(inode)->i_ceph_flags); unlock_new_inode(inode); } if (d_in_lookup(dentry) || d_really_is_negative(dentry)) { @@ -2422,7 +2422,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, st= ruct iov_iter *from) =20 if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) =3D=3D 0 || (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) || - (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { + test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) { struct ceph_snap_context *snapc; struct iov_iter data; =20 diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index d99e12d1100b..f75d66760d54 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1142,7 +1142,7 @@ int ceph_fill_inode(struct inode *inode, struct page = *locked_page, rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns); =20 if (ci->i_layout.pool_id !=3D old_pool || pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 pool_ns =3D old_ns; =20 @@ -3199,7 +3199,7 @@ void ceph_inode_shutdown(struct inode *inode) bool invalidate =3D false; =20 spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_SHUTDOWN; + set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags); p =3D rb_first(&ci->i_caps); while (p) { struct ceph_cap *cap =3D rb_entry(p, struct ceph_cap, ci_node); diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index dd764f9c64b9..c4ff2266bb94 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl) ci =3D ceph_inode(inode); if (atomic_dec_and_test(&ci->i_filelock_ref)) { /* clear error when all locks are released */ - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_FILELOCK; - spin_unlock(&ci->i_ceph_lock); + clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); } fl->fl_u.ceph.inode =3D NULL; iput(inode); @@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file= _lock *fl) else if (IS_SETLKW(cmd)) wait =3D 1; =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl)) posix_lock_file(file, fl, NULL); - return err; + return -EIO; } =20 if (lock_is_read(fl)) @@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct fil= e_lock *fl) =20 doutc(cl, "fl_file: %p\n", fl->c.flc_file); =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (lock_is_unlock(fl)) locks_lock_file_wait(file, fl); - return err; + return -EIO; } =20 if (IS_SETLKW(cmd)) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index b1746273f186..ccf0d53dde2b 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3613,7 +3613,8 @@ static void __do_request(struct ceph_mds_client *mdsc, =20 spin_lock(&ci->i_ceph_lock); cap =3D ci->i_auth_cap; - if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds !=3D cap->mds) { + if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) && + mds !=3D cap->mds) { doutc(cl, "session changed for auth cap %d -> %d\n", cap->session->s_mds, session->s_mds); =20 diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 0428a5eaf28c..e91a199d56fd 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -658,7 +658,7 @@ static inline int ceph_wait_on_async_create(struct inod= e *inode) { struct ceph_inode_info *ci =3D ceph_inode(inode); =20 - return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT, + return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT, TASK_KILLABLE); } =20 diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 52b4c2684f92..9b79a5eaca93 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci, return 0; } =20 - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=3D%llu\n", inode, ceph_vinop(inode), capsnap, capsnap->context, capsnap->context->seq, ceph_cap_string(capsnap->dirty), diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 29a980e22dc2..c89ad8dcc969 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -655,23 +655,32 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, /* * Ceph inode. */ -#define CEPH_I_DIR_ORDERED (1 << 0) /* dentries in dir are ordered */ -#define CEPH_I_FLUSH (1 << 2) /* do not delay flush of dirty metadata */ -#define CEPH_I_POOL_PERM (1 << 3) /* pool rd/wr bits are valid */ -#define CEPH_I_POOL_RD (1 << 4) /* can read from pool */ -#define CEPH_I_POOL_WR (1 << 5) /* can write to pool */ -#define CEPH_I_SEC_INITED (1 << 6) /* security initialized */ -#define CEPH_I_KICK_FLUSH (1 << 7) /* kick flushing caps */ -#define CEPH_I_FLUSH_SNAPS (1 << 8) /* need flush snapss */ -#define CEPH_I_ERROR_WRITE (1 << 9) /* have seen write errors */ -#define CEPH_I_ERROR_FILELOCK (1 << 10) /* have seen file lock errors */ -#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ -#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) -#define CEPH_ASYNC_CREATE_BIT (12) /* async create in flight for this */ -#define CEPH_I_ASYNC_CREATE (1 << CEPH_ASYNC_CREATE_BIT) -#define CEPH_I_SHUTDOWN (1 << 13) /* inode is no longer usable */ -#define CEPH_I_ASYNC_CHECK_CAPS (1 << 14) /* check caps immediately after = async - creating finishes */ +#define CEPH_I_DIR_ORDERED_BIT (0) /* dentries in dir are ordered */ + /* bit 1 historically unused */ +#define CEPH_I_FLUSH_BIT (2) /* do not delay flush of dirty metadata */ +#define CEPH_I_POOL_PERM_BIT (3) /* pool rd/wr bits are valid */ +#define CEPH_I_POOL_RD_BIT (4) /* can read from pool */ +#define CEPH_I_POOL_WR_BIT (5) /* can write to pool */ +#define CEPH_I_SEC_INITED_BIT (6) /* security initialized */ +#define CEPH_I_KICK_FLUSH_BIT (7) /* kick flushing caps */ +#define CEPH_I_FLUSH_SNAPS_BIT (8) /* need flush snaps */ +#define CEPH_I_ERROR_WRITE_BIT (9) /* have seen write errors */ +#define CEPH_I_ERROR_FILELOCK_BIT (10) /* have seen file lock errors */ +#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ +#define CEPH_I_ASYNC_CREATE_BIT (12) /* async create in flight for this */ +#define CEPH_I_SHUTDOWN_BIT (13) /* inode is no longer usable */ +#define CEPH_I_ASYNC_CHECK_CAPS_BIT (14) /* check caps after async creatin= g finishes */ + +#define CEPH_I_DIR_ORDERED (1 << CEPH_I_DIR_ORDERED_BIT) +#define CEPH_I_FLUSH (1 << CEPH_I_FLUSH_BIT) +#define CEPH_I_POOL_PERM (1 << CEPH_I_POOL_PERM_BIT) +#define CEPH_I_POOL_RD (1 << CEPH_I_POOL_RD_BIT) +#define CEPH_I_POOL_WR (1 << CEPH_I_POOL_WR_BIT) +#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT) +#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT) +#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT) +#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) +#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT) =20 /* * Masks of ceph inode work. @@ -684,27 +693,18 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, =20 /* * We set the ERROR_WRITE bit when we start seeing write errors on an inode - * and then clear it when they start succeeding. Note that we do a lockless - * check first, and only take the lock if it looks like it needs to be cha= nged. - * The write submission code just takes this as a hint, so we're not too - * worried if a few slip through in either direction. + * and then clear it when they start succeeding. The write submission code + * just takes this as a hint, so we're not too worried if a few slip throu= gh + * in either direction. */ static inline void ceph_set_error_write(struct ceph_inode_info *ci) { - if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void ceph_clear_error_write(struct ceph_inode_info *ci) { - if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci, @@ -1142,7 +1142,7 @@ static inline bool ceph_inode_is_shutdown(struct inod= e *inode) struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode); int state =3D READ_ONCE(fsc->mount_state); =20 - return (flags & CEPH_I_SHUTDOWN) || state >=3D CEPH_MOUNT_SHUTDOWN; + return (flags & BIT(CEPH_I_SHUTDOWN_BIT)) || state >=3D CEPH_MOUNT_SHUTDO= WN; } =20 /* xattr.c */ diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 5f87f62091a1..7cf9e908c2fe 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const ch= ar *name, void *value, if (current->journal_info && !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) && security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN)) - ci->i_ceph_flags |=3D CEPH_I_SEC_INITED; + set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags); out: spin_unlock(&ci->i_ceph_lock); return err; --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2BC43E3D93 for ; Wed, 15 Apr 2026 17:01:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272489; cv=none; b=bZYJULLhi3ztkPnMCnBAwlxgrClOSKn2hSsg0DPbFpLQAnwHhSE86IHKYfvlUeBYMpljaUPz9iYZI9CUAgHW0YRocPB9G4SHQ95EuSxXHL2+CHrnw2PawDl9WGESxB2ZRIBYy/pgStMlr6W7HulKhaDUmLgdvIG37+rUOptUTXU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272489; c=relaxed/simple; bh=nKbqGY84rIJB8BOdLzFBm2ltPcQM/ROZhxewUnBDS9I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=U3RzNKLe3oJiUWjTo0++gSf7KHc5uyjg6+kNC/gNLTehsr/QEID02X4qTnBIpBobZn4+J7LztvqwayvsKFX2lzF+Ba2ZAfg2VBdN7ib3egDrJLVlbLWgq2xMH9vndt71UqXtXpBi23Ar6oKs/w2slC01TETbGDuv89La0AP5EjE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=NtDYC7+y; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=nSmqfcM0; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="NtDYC7+y"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="nSmqfcM0" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=; b=NtDYC7+yejGz1eAGaQz89VYzXVESMEfQVNHK8pdgMcpEU84fR+n0uTvbvkbQKug8LvIc8u JkBvcvGdrnmpzlWJa6Vyo5vlS4TsEKPiN8X8cDUnCW+fKQMvRoJqUKCvYGXmVaav72JOd5 VLjq7FgaL/rb2MdbUeDy8ZxInHOXUcU= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-146-B8K31ByoPNKk0w8QMnZGsw-1; Wed, 15 Apr 2026 13:01:23 -0400 X-MC-Unique: B8K31ByoPNKk0w8QMnZGsw-1 X-Mimecast-MFC-AGG-ID: B8K31ByoPNKk0w8QMnZGsw_1776272483 Received: by mail-qt1-f198.google.com with SMTP id d75a77b69052e-50d63962d83so167591911cf.2 for ; Wed, 15 Apr 2026 10:01:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272483; x=1776877283; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=; b=nSmqfcM04Cx16DM6QJrk98i5mkVXdtlWp/ZA2jlBFi0q9ZrLo78oeHLAyn8gkZldJm eVdFrHwETW2dnl4/CO1CypDO5RlxlvSFu/z+WuMC7mK83w7wrLh5V/nOZBFa6o6wcc/v mtoYKaIrVB5+R/1n+UR89Uzf1fXZOhbDpcTG2nWVnimoGKpcaZ/m7yTSpbsuQ5FiBzOq /uZj0v2ivD/1w6UqhAIRkdHGdqZQC64eSDdh78Z1gmoZBRt/i5mVErzrqqYkbq+eRgy+ aovLGR7mj6LVxTo5DNYns+Oea2CT4CStXr17lG7OxtWJxzAco/3vGkNE+3NxdkOanVzZ oJNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272483; x=1776877283; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=6Nj+i/zUf6yIYxwBMDoj27NRKQ28v/DcLzCnlkUqOk8=; b=F0L1pgKh1MoWdbgoyCUV8pGFB5ESkRcjqdy9cZ1+/mHbJ7sVwALi20mOnFV2AFTih9 GRUAEI5VzqCnNkQy2PQ3lQIw9W1ufU27lqelOsnzoL8B/iKMoGw6VenYPuXJX+9PJKME o7XAmB8n7hFodGdB+9SaGACoIm84uVjIyKDQ2pepfegGmFGy9YfAEzA12tOtGerclBKz v/fGDyTaOOMhsSZ3dLlUx5vUiaC+qYOEZlf9i02DiRVvumlZA4h1YCAonhksg5h6HmbK JC8pvzDzjI6ZRdLq3VqxA+qP2t/+VrHWOZm9Fat/xMJLI412Cet0QjfAbO+Dq11lhum2 eFqA== X-Gm-Message-State: AOJu0YyjgLZzz0UXYyd6Y+mgDOUL6I46122opLxR8jhI36sKdLfkU275 8wXpnqOlpzsHpFIof08UT2kENAAGRtk1vmzXgbzpwnO87QJxMb+gWQuYUkvZSE7hacDEZgE8AkU dLMscVN8hHdUMCD3fGVKJ+8oTdJtsbBsMjYxuzlLCY5P677oF+DcL7mqVeslM+vm7+g== X-Gm-Gg: AeBDietzFZuYTOTYXqmIaNgeOgrhR/DWVXjmGAY/oS25MQC2SDOZGKDm3A8juKBDvjF R1ZRrESB4vyzXSQaljF14c0mmFiIcqxCnvAyDosp8WHL3z2WOzVi5PgmbI2qNIh6eRYiOIuPua8 I/9vJpCT8hdzXXmevSnDF0EZRs7Kwu17ok+lrOpzQSlV17c60mwgxZlFcRkTk+Oe80JLl+aolED RMzVPb6HlVEiQVahp2DbsFprCE1jo4wiA+Wiogmbv5+h2FAV98Gu0fKccuf/GzSJmW6Qtmz45Os 5efPeqt3AT0wYIoyZmvh41uVjTeyayj03E96Y2/nRSQt1K5/hvpiMooPr6ntt+n+jSwpL4EUdtO 8XBdXratVBy8XeMT6jOvUBiGW1uVobZntXa2W25uiWPz9HDoPywV42RVHye7Q5FMHeA== X-Received: by 2002:a05:622a:10c:b0:50d:c69c:d01c with SMTP id d75a77b69052e-50dd5c3d6e0mr339846751cf.37.1776272482479; Wed, 15 Apr 2026 10:01:22 -0700 (PDT) X-Received: by 2002:a05:622a:10c:b0:50d:c69c:d01c with SMTP id d75a77b69052e-50dd5c3d6e0mr339824561cf.37.1776272464530; Wed, 15 Apr 2026 10:01:04 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:04 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 2/7] ceph: use proper endian conversion for flock_len in reconnect Date: Wed, 15 Apr 2026 17:00:38 +0000 Message-Id: <20260415170043.3882912-3-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace the __force __le32 cast with cpu_to_le32() for the flock_len field in reconnect_caps_cb(). The old code used a type-system bypass to silence sparse; the new form uses the proper endian conversion macro. Also switch from a raw bitmask test against i_ceph_flags to test_bit() on the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the unsigned long flags field after the bit-position conversion. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko --- fs/ceph/mds_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index ccf0d53dde2b..871f0eef468d 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4693,8 +4693,9 @@ static int reconnect_caps_cb(struct inode *inode, int= mds, void *arg) rec.v2.issued =3D cpu_to_le32(cap->issued); rec.v2.snaprealm =3D cpu_to_le64(ci->i_snap_realm->ino); rec.v2.pathbase =3D cpu_to_le64(path_info.vino.ino); - rec.v2.flock_len =3D (__force __le32) - ((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1); + rec.v2.flock_len =3D cpu_to_le32( + test_bit(CEPH_I_ERROR_FILELOCK_BIT, + &ci->i_ceph_flags) ? 0 : 1); } else { struct timespec64 ts; =20 --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92530346A08 for ; Wed, 15 Apr 2026 17:06:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272821; cv=none; b=cUsg2E9X74ybFcQ1KmpeFxeo12dUVONTznroTpx+WPAz3tL+me7ZYuoq9af1mFYL4XgCVJ5uhTtC2hxrJrI7T0mz5doVQKUCMmTTDJiaf1oMM7veb/P7RmcepEoktojW4PjVAI5sQ25aFHkmJKsJH5mfeHU/HlVLTAUNy4lXBiY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272821; c=relaxed/simple; bh=xRN7giwH8NEcKW9yTVpwkdPtpHGQFSSZf19t6Lm+GmU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=TXH7C5z0gzgXmE7aPUlYcAcOuFLTjtxAh/NyPXoUBnp9lRRHb5HPMzPgCTHL08s/3SFlWP8i8AscuO7o6C96i+SsgIdhKnYIoguXHoTndQlwB4qQVZepWFNbv/PjNDrGNhh5O/djMVLeFE80Ij123up0O8CzCR8nNsegt80pUIU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=eVIMcvpq; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=stxscCVa; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="eVIMcvpq"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="stxscCVa" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272818; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=; b=eVIMcvpqBhI8Es8Ar6G5Nh54DDYbGEEpwAnyZO0rDEg1nAyqFtJrwk79Ud6G796QVnBXHU BKPq7pHAZg7zBCcTv1cpjfJ291jiz9KqZZkUbqXydQ+Bj/3rzZKLHbGhNOYc5V9Xj9pUxS 4t3uLm5QIePxDTySgf+gzLHYLnpYzVw= Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com [209.85.214.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-137-fpDOANljPZ-HaCP3NtH6pw-1; Wed, 15 Apr 2026 13:06:57 -0400 X-MC-Unique: fpDOANljPZ-HaCP3NtH6pw-1 X-Mimecast-MFC-AGG-ID: fpDOANljPZ-HaCP3NtH6pw_1776272816 Received: by mail-pl1-f197.google.com with SMTP id d9443c01a7336-2b242062308so129539635ad.2 for ; Wed, 15 Apr 2026 10:06:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272816; x=1776877616; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=; b=stxscCVak3JxpecO9DJ73BxhiJ+OMGPmZa8dEcr9ZLg4UwNkEL1G9Eo3JZ3FP42KXC HQU2Mc9rkB3Ds/IHJP4q/OIYwTx2ygVXt/KgG4GCIB+CR+5JiK2ALJQlN3ZiM3cF/FVz xd/MP6Uk6i89/cMjx2BzVk4TwjgxJfk3eugn2zdZFO5Sum5HuxHqkhnHh+zWGf2mRmfx GnO9YD1dQuYmWgYjR/6bb2bsD7Dh/r5sGifTEaFu6Oz8a5PpdigdYvUZRVGyilSE5ALn 8QQRt3uSYKotCGpmPlRXuVi0QCko3SIF7xhpPLCa24eF9qRBxRdN7mL2GDekGKdVD2cr 4lNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272816; x=1776877616; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=AT7s8MIORBy5Iq5wFLBR87C37LfOCNHX9KOc+RZPc4Y=; b=US2pzbIXErQtqyTG9WeZOs5hKMrbgJdvTxn7vLvKEgZhDRwzRep2QYxpxcKSZPhbC7 4NJaOVoZ3Xv3BFP6HKcokaGgWfmP3VT6QBUVz1YAYxclD4QqFudcc5sThfMwlfzi0h5q ILQG5Wb2Tki0ISCYQW7CsUtJWQopu2VgOp8iGGxIh41FB5KI+4FEAIviPYRHSaZ9COsD ZQJu9TJvIlTTJvDTcPJXH1lCtVOZ0rtdH1Au6u78njanUu+pJHGeSlIkrJmpHzzDNLQt e4J68X+6ILwp4H1mRREowqgpHyXFYoF4TFQXcqH6mFQvJ7r7Q64iMxDegyhnw6GvUSTk VNQg== X-Gm-Message-State: AOJu0YwBfHxvRFg5ITfU+hYSz+XwbATHPNJ35CxQtI+PkwSXeZkXy9U3 dAYOnXxdzdZVuXS0tIqinyia/cyMz6Wtqi4xwtet1woXpFf7c6+aNccEiyoDfDK9o2JivbdOMKP Thrh3lM+O7y4coEZ2c1nppa9uu2CgyFsSGcGSFz0FJvstUJftT4OKWXwZ8L5hpS0hMZ9DCtJajT Br X-Gm-Gg: AeBDiet7uPvb2rvFpiKf47GvW1T/nUtj4RcanKkdOhU0dHQNf+1Pmz6fvqc6FaUQM6M /cF2qk+R6yo3pxI0BcqfwFq48cdVgRem8g9BOW4xfZxuwaJ73iiRkoiSbt1vnqJXNJScERXFtPe 24UMTjnSwjdU6Jlfl2nCmuJWq6V/zCGLAjpiEa9sESqOopU5fYMHLLjvB52qz6Ku4oC9Jz9O7fj BsKynGxyD1pgvDcvFdGZdWg36ufWh5kquC2gD3xmkp0LFV2gQafVdX81fbJUNNcC5ZTxx046XzW 5J142HLkjrLfk5w1Tyxpe7IkrAhJyrk0IbIIUNAkd8dEuE4w8c3SS2AlruII+yMdBTXEzcYaooT 3/1pbl2Ec5gjpAIeIMtjmSV3DVDBYfANsj/a/YfHmFWOSYjdkQKmFMDKbBK3CLdIdYQ== X-Received: by 2002:a17:903:2290:b0:2b4:59d4:9a with SMTP id d9443c01a7336-2b459d403abmr150753385ad.2.1776272815618; Wed, 15 Apr 2026 10:06:55 -0700 (PDT) X-Received: by 2002:a05:622a:5c0d:b0:50d:7304:1770 with SMTP id d75a77b69052e-50dd5a83830mr316724691cf.8.1776272467092; Wed, 15 Apr 2026 10:01:07 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:06 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 3/7] ceph: harden send_mds_reconnect and handle active-MDS peer reset Date: Wed, 15 Apr 2026 17:00:39 +0000 Message-Id: <20260415170043.3882912-4-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Change send_mds_reconnect() to return an error code so callers can detect and report reconnect failures instead of silently ignoring them. Add early bailout checks for sessions that are already closed, rejected, or unregistered, which avoids sending reconnect messages for sessions that can no longer be recovered. The early -ESTALE and -ENOENT bailouts use a separate fail_return label that skips the pr_err_client diagnostic, since these codes indicate expected concurrent-teardown races rather than genuine reconnect build failures. Save the prior session state before transitioning to RECONNECTING, and restore it in the failure path. Without this, a transient build or encoding failure (-ENOMEM, -ENOSPC) strands the session in RECONNECTING indefinitely because check_new_map() only retries sessions in RESTARTING state. Rewrite mds_peer_reset() to handle the case where the MDS is past its RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT messages because it only accepts them during its own RECONNECT window after restart. Previously, the client would send a doomed reconnect that the MDS would reject or ignore. Now, the client tears the session down locally and lets new requests re-open a fresh session, which is the correct recovery for this scenario. The RECONNECTING state is handled on the same teardown path, since the MDS will reject reconnect attempts from an active client regardless of the session's local state. The session teardown path in mds_peer_reset() follows the established drop-and-reacquire locking pattern from check_new_map(): take mdsc->mutex for session unregistration, release it, then take s->s_mutex separately for cleanup. This avoids introducing a new simultaneous lock nesting pattern. Log reconnect failures from check_new_map() and mds_peer_reset() at pr_warn level rather than pr_err, since return codes like -ESTALE (closed/rejected session) and -ENOENT (unregistered session) are expected during concurrent teardown. Log dropped messages for unregistered sessions via doutc() (dynamic debug) rather than pr_info, as post-reset message arrival is routine and does not warrant unconditional logging. Signed-off-by: Alex Markuze --- fs/ceph/mds_client.c | 163 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 151 insertions(+), 12 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 871f0eef468d..b14ede808436 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *= session, break; =20 case CEPH_SESSION_REJECT: - WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING); - pr_info_client(cl, "mds%d rejected session\n", - session->s_mds); + WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING && + session->s_state !=3D CEPH_MDS_SESSION_RECONNECTING); + if (session->s_state =3D=3D CEPH_MDS_SESSION_RECONNECTING) + pr_info_client(cl, "mds%d reconnect rejected\n", + session->s_mds); + else + pr_info_client(cl, "mds%d rejected session\n", + session->s_mds); session->s_state =3D CEPH_MDS_SESSION_REJECTED; cleanup_session_requests(mdsc, session); remove_session_caps(session); @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, in= t mds, void *arg) cap->mseq =3D 0; /* and migrate_seq */ cap->cap_gen =3D atomic_read(&cap->session->s_cap_gen); =20 + /* + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect. + * Instead, locks are submitted for best-effort MDS reclaim + * via the flock_len field below. If reclaim fails (e.g., + * another client grabbed a conflicting lock), future lock + * operations will fail and set the error flag at that point. + */ + /* These are lost when the session goes away */ if (S_ISDIR(inode->i_mode)) { if (cap->issued & CEPH_CAP_DIR_CREATE) { @@ -4892,13 +4905,14 @@ static int encode_snap_realms(struct ceph_mds_clien= t *mdsc, * * This is a relatively heavyweight operation, but it's rare. */ -static void send_mds_reconnect(struct ceph_mds_client *mdsc, - struct ceph_mds_session *session) +static int send_mds_reconnect(struct ceph_mds_client *mdsc, + struct ceph_mds_session *session) { struct ceph_client *cl =3D mdsc->fsc->client; struct ceph_msg *reply; int mds =3D session->s_mds; int err =3D -ENOMEM; + int old_state; struct ceph_reconnect_state recon_state =3D { .session =3D session, }; @@ -4917,6 +4931,31 @@ static void send_mds_reconnect(struct ceph_mds_clien= t *mdsc, xa_destroy(&session->s_delegated_inos); =20 mutex_lock(&session->s_mutex); + if (session->s_state =3D=3D CEPH_MDS_SESSION_CLOSED || + session->s_state =3D=3D CEPH_MDS_SESSION_REJECTED) { + pr_info_client(cl, "mds%d skipping reconnect, session %s\n", + mds, + ceph_session_state_name(session->s_state)); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ESTALE; + goto fail_return; + } + + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || mdsc->sessions[mds] !=3D session) { + mutex_unlock(&mdsc->mutex); + pr_info_client(cl, + "mds%d skipping reconnect, session unregistered\n", + mds); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ENOENT; + goto fail_return; + } + mutex_unlock(&mdsc->mutex); + + old_state =3D session->s_state; session->s_state =3D CEPH_MDS_SESSION_RECONNECTING; session->s_seq =3D 0; =20 @@ -5046,18 +5085,34 @@ static void send_mds_reconnect(struct ceph_mds_clie= nt *mdsc, =20 up_read(&mdsc->snap_rwsem); ceph_pagelist_release(recon_state.pagelist); - return; + return 0; =20 fail: ceph_msg_put(reply); up_read(&mdsc->snap_rwsem); + /* + * Restore prior session state so map-driven reconnect logic + * (check_new_map) can retry. Without this, a transient build + * failure strands the session in RECONNECTING indefinitely. + */ + session->s_state =3D old_state; mutex_unlock(&session->s_mutex); fail_nomsg: ceph_pagelist_release(recon_state.pagelist); fail_nopagelist: pr_err_client(cl, "error %d preparing reconnect for mds%d\n", err, mds); - return; + return err; + +fail_return: + /* + * Early-exit path for expected concurrent-teardown races + * (-ESTALE for closed/rejected sessions, -ENOENT for + * unregistered sessions). Skip the pr_err_client diagnostic + * since these are not genuine reconnect build failures. + */ + ceph_pagelist_release(recon_state.pagelist); + return err; } =20 =20 @@ -5138,9 +5193,15 @@ static void check_new_map(struct ceph_mds_client *md= sc, */ if (s->s_state =3D=3D CEPH_MDS_SESSION_RESTARTING && newstate >=3D CEPH_MDS_STATE_RECONNECT) { + int rc; + mutex_unlock(&mdsc->mutex); clear_bit(i, targets); - send_mds_reconnect(mdsc, s); + rc =3D send_mds_reconnect(mdsc, s); + if (rc) + pr_warn_client(cl, + "mds%d reconnect failed: %d\n", + i, rc); mutex_lock(&mdsc->mutex); } =20 @@ -5204,7 +5265,11 @@ static void check_new_map(struct ceph_mds_client *md= sc, } doutc(cl, "send reconnect to export target mds.%d\n", i); mutex_unlock(&mdsc->mutex); - send_mds_reconnect(mdsc, s); + err =3D send_mds_reconnect(mdsc, s); + if (err) + pr_warn_client(cl, + "mds%d export target reconnect failed: %d\n", + i, err); ceph_put_mds_session(s); mutex_lock(&mdsc->mutex); } @@ -6284,12 +6349,84 @@ static void mds_peer_reset(struct ceph_connection *= con) { struct ceph_mds_session *s =3D con->private; struct ceph_mds_client *mdsc =3D s->s_mdsc; + int session_state; =20 pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n", s->s_mds); - if (READ_ONCE(mdsc->fsc->mount_state) !=3D CEPH_MOUNT_FENCE_IO && - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >=3D CEPH_MDS_STATE_REC= ONNECT) - send_mds_reconnect(mdsc, s); + + if (READ_ONCE(mdsc->fsc->mount_state) =3D=3D CEPH_MOUNT_FENCE_IO || + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONN= ECT) + return; + + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) =3D=3D CEPH_MDS_STATE_R= ECONNECT) { + int rc =3D send_mds_reconnect(mdsc, s); + + if (rc) + pr_warn_client(mdsc->fsc->client, + "mds%d reconnect failed: %d\n", + s->s_mds, rc); + return; + } + + /* + * MDS is active (past RECONNECT). It will not accept a + * CLIENT_RECONNECT from us, so tear the session down locally + * and let new requests re-open a fresh session. + * + * Snapshot session state with READ_ONCE, then revalidate under + * mdsc->mutex before acting. The subsequent mdsc->mutex + * section rechecks s_state to catch concurrent transitions, so + * the lockless snapshot here is safe. s->s_mutex is taken + * separately for cleanup after unregistration, which avoids + * introducing a new s->s_mutex + mdsc->mutex nesting. + */ + session_state =3D READ_ONCE(s->s_state); + + switch (session_state) { + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + mutex_lock(&mdsc->mutex); + if (s->s_mds >=3D mdsc->max_sessions || + mdsc->sessions[s->s_mds] !=3D s || + s->s_state !=3D session_state) { + pr_info_client(mdsc->fsc->client, + "mds%d state changed to %s during peer reset\n", + s->s_mds, + ceph_session_state_name(s->s_state)); + mutex_unlock(&mdsc->mutex); + return; + } + + ceph_get_mds_session(s); + s->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, s); + __wake_requests(mdsc, &s->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&s->s_mutex); + cleanup_session_requests(mdsc, s); + remove_session_caps(s); + mutex_unlock(&s->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, s->s_mds); + mutex_unlock(&mdsc->mutex); + + ceph_put_mds_session(s); + break; + default: + pr_warn_client(mdsc->fsc->client, + "mds%d peer reset in unexpected state %s\n", + s->s_mds, + ceph_session_state_name(session_state)); + break; + } } =20 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg) @@ -6301,6 +6438,8 @@ static void mds_dispatch(struct ceph_connection *con,= struct ceph_msg *msg) =20 mutex_lock(&mdsc->mutex); if (__verify_registered_session(mdsc, s) < 0) { + doutc(cl, "dropping tid %llu from unregistered session %d\n", + le64_to_cpu(msg->hdr.tid), s->s_mds); mutex_unlock(&mdsc->mutex); goto out; } --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 708FB3E51ED for ; Wed, 15 Apr 2026 17:01:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272477; cv=none; b=nnkv4uCHMSmqjWUGszQ6BqrabWOvoso87Se0BzxE20qolp0Xos0p7d0R36innczLviYUhQLQbbbUiek0BVvHiaTtHE0FA5jpIscQQVIbHFR9z/g2pHv8LSCpWUxl2yGKPv0AHwHUvbMZr/eMcqJp+u1w+CJ0S353qgO1xen3ZhQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272477; c=relaxed/simple; bh=+CTSEsp/2olU8L6a/rv052YQpe4i7E93BUYU+16kL6E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=LEL0/qrMjj5aMOpxKjOSQN2jtNSuYFmNbt51uEZDHcTAbRGfiu2rcqwJe6q8zCpjZidOimDsZnTYYvx/FN/pJbjXhxiwO/Z/c/RvDZZzL5scvH6Q/fvFAXsGLJAAUDu+CydJl8XduxMrIKb7iYZFVyiTo0owMlUhy8u/a3nyoUs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=TcKPk9V4; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=Mq5tueLE; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TcKPk9V4"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="Mq5tueLE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272474; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=; b=TcKPk9V47vJ/cqMfsTxqGiduzX3F8XfIe2gcTmXJpTKzBgfd62s0Dmq0WAZ5A2mgjQECol UmvGmw5vXHfnvQCItpUQk65AW/S5nwIxqWYKafQSrSwq1CeLzfgB8S+Z4mBCAccPW8cBn3 6SkFILo9LjrUMLHbgRNugLDjdhD96i4= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-283-jsF3jEvxOByCU_cPqj1ecQ-1; Wed, 15 Apr 2026 13:01:13 -0400 X-MC-Unique: jsF3jEvxOByCU_cPqj1ecQ-1 X-Mimecast-MFC-AGG-ID: jsF3jEvxOByCU_cPqj1ecQ_1776272472 Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-50d76f460b2so182208511cf.2 for ; Wed, 15 Apr 2026 10:01:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272472; x=1776877272; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=; b=Mq5tueLEdfHrlPkFgPdOfYmUFNxCg0HAfslBQ7KLWFA0lsKVej7X3wCv3suHT6/1+w +uAWH214QOjr1sFdLrqOUClsc1ncaBRy4DLJgN64US1wj9v8IlapTXdnw9oVhr+dwhz5 NXCbJ/feEZwhvRZpqXKu9PTGVoRj25tojae9FswuwLTqbqZh3dFGoa4qcisHc0p+O7t7 6NbOG9NvAr8HLIjFiq0mfxO8do8osiuSR8XNttU2JaPI96oNl6whz8pj3YoQKleg73Ch iGwTwNJv5c+iPpNIaxfFzEqoTjBJKLf5NNZXe7CMiz/7CEPmQPDcnkrYZu4I68ZpgrZy P45Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272472; x=1776877272; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=jp1zrO1uuVX04cOgiBvHowBM6uu8e3oJ6YY/cYsJ/g0=; b=dVYXUGMnkfUXWtmmNqCPdoefmAQ85R41IFf/6eBq7AWPdthZGNkUKSnM1l7+NhvcpJ rViUx7IR6RKvshhDQcOER/PWZJtnTKhnQdVP6mgia3lk9qjPoc+N7R0x6Nuxhm/4nvWr 5frTSFpjcf7l3Hd2msl8n0pYYvOMoJkMKjLkLbapjSe7v/Yy45IJN6fEx8ammBYlb2JO D/ZEcz4lbM4QaWQg4hlcI8eNfpNHhnU4oGem6wnCfYaOY8z4LmoN20VuW2EnBJG89FSv zdkTqR4pWWtKJOe3U4rDhEL+ipC3W+VaVk+ZzH2i977lMvlf/FmdJtGVI3+78xfBkl7v nNUQ== X-Gm-Message-State: AOJu0YzVh8S332V9sZ2J2bjVIP5hFKlMbHCNcd0rG/1lRQQLk1k1O3jG 6VSmiW8d+irBVyuoVIdDVLSt908AvjOjhwPEXbZomXbaU7DJDHldOcZqOhclTYb8cJBXJX5fdua mIcNxSHn0d6R3wknka+Gr2L2p1n9LZfDYqKFEPoNq7lv5lFVZvMdRFCHE9DaNrlScPQ== X-Gm-Gg: AeBDieuJ2xZqyzSPtddF+wksE4DchoxfteDDSrQotOB7VvjHRqeAF7c0q0liQabmyiy tW5CUrDkYyR5iP/tJWU3FDw4NFKUrDd8WFqZkaySvKx2pOn872JVEMyd4/My5m+5mz7N1nl6Hpz SabPQ9Efe8m00qtPJCJQhdOLwC3A7FMf8zAg5Y6TBeX0AdXOsgrS5LRLHIgymwbC2vNQxS5btzs uqmBO+2Y9qmMMR0MwYyA2oLdycDm7twNFmeTRjEsB0zC5J7ZHQD+JUZ+MjoaDvCWlQnAG/ZQ2JZ 7jtEWlMfyF+lO+ERYDLyVdYMtpsbGo45o/huFIX7+WdBOocHgtosy6QXI1Nd43O6yJaGZN0HVJ/ sokdv3blQXaBa5GjKpT0DlIz5zEepV8jQeJ6Wr+8oBBW5HOv1xxoA9E4PeVUspOX6zw== X-Received: by 2002:a05:622a:1b1e:b0:50d:38d5:c6b1 with SMTP id d75a77b69052e-50dd5b526bamr327474031cf.16.1776272472265; Wed, 15 Apr 2026 10:01:12 -0700 (PDT) X-Received: by 2002:a05:622a:1b1e:b0:50d:38d5:c6b1 with SMTP id d75a77b69052e-50dd5b526bamr327470971cf.16.1776272470063; Wed, 15 Apr 2026 10:01:10 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:09 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 4/7] ceph: add diagnostic timeout loop to wait_caps_flush() Date: Wed, 15 Apr 2026 17:00:40 +0000 Message-Id: <20260415170043.3882912-5-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Convert wait_caps_flush() from a silent indefinite wait into a diagnostic wait loop that periodically dumps pending cap flush state. The underlying wait semantics remain intact: callers still wait until the requested cap flushes complete. The difference is that long stalls now produce actionable diagnostics instead of looking like a silent hang. CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways: it limits the number of entries emitted per diagnostic dump, and it limits the number of timed diagnostic dumps before the wait continues silently. READ_ONCE is used for the i_last_cap_flush_ack field, which is read outside the inode lock domain. Add a ci pointer to struct ceph_cap_flush so that the diagnostic dump can identify which inode each pending flush belongs to. The new i_last_cap_flush_ack field tracks the latest acknowledged flush tid per inode for diagnostic correlation. This improves reset-drain observability and is also useful for existing sync and writeback troubleshooting paths. Signed-off-by: Alex Markuze --- fs/ceph/caps.c | 5 ++++ fs/ceph/inode.c | 1 + fs/ceph/mds_client.c | 56 ++++++++++++++++++++++++++++++++++++++++---- fs/ceph/super.h | 6 +++++ 4 files changed, 64 insertions(+), 4 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index cb9e78b713d9..c40175dd77ae 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, =20 spin_lock(&mdsc->cap_dirty_lock); capsnap->cap_flush.tid =3D ++mdsc->last_cap_flush_tid; + capsnap->cap_flush.ci =3D ci; list_add_tail(&capsnap->cap_flush.g_list, &mdsc->cap_flush_list); if (oldest_flush_tid =3D=3D 0) @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void) return NULL; =20 cf->is_capsnap =3D false; + cf->ci =3D NULL; return cf; } =20 @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode, doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode)); =20 swap(cf, ci->i_prealloc_cap_flush); + cf->ci =3D ci; cf->caps =3D flushing; cf->wake =3D wake; =20 @@ -3826,6 +3829,8 @@ static void handle_cap_flush_ack(struct inode *inode,= u64 flush_tid, bool wake_ci =3D false; bool wake_mdsc =3D false; =20 + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid); + list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) { /* Is this the one that was flushed? */ if (cf->tid =3D=3D flush_tid) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index f75d66760d54..de465c7e96e8 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ci->i_cap_snaps); ci->i_head_snapc =3D NULL; ci->i_snap_caps =3D 0; + ci->i_last_cap_flush_ack =3D 0; =20 ci->i_last_rd =3D ci->i_last_wr =3D jiffies - 3600 * HZ; for (i =3D 0; i < CEPH_FILE_MODE_BITS; i++) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index b14ede808436..7d17332d72d7 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -27,6 +27,8 @@ #include =20 #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE) +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60 +#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5 =20 /* * A cluster of MDS (metadata server) daemons is responsible for @@ -2286,19 +2288,65 @@ static int check_caps_flush(struct ceph_mds_client = *mdsc, } =20 /* - * flush all dirty inode data to disk. + * Dump pending cap flushes for diagnostic purposes. * - * returns true if we've flushed through want_flush_tid + * cf->ci is safe to dereference here because the cap_dirty_lock is + * held, and cap_flush entries are removed from the global + * cap_flush_list under the same lock in the purge path. + */ +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid) +{ + struct ceph_client *cl =3D mdsc->fsc->client; + struct ceph_cap_flush *cf; + int dumped =3D 0; + + pr_info_client(cl, "still waiting for cap flushes through %llu:\n", + want_tid); + spin_lock(&mdsc->cap_dirty_lock); + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) { + if (cf->tid > want_tid) + break; + if (++dumped > CEPH_CAP_FLUSH_MAX_DUMP_COUNT) + break; + if (!cf->ci) { + pr_info_client(cl, + "(null ci) %s tid=3D%llu wake=3D%d%s\n", + ceph_cap_string(cf->caps), cf->tid, + cf->wake, + cf->is_capsnap ? " is_capsnap" : ""); + continue; + } + pr_info_client(cl, + "%llx:%llx %s tid=3D%llu last_ack=3D%llu wake=3D%d%s\n", + ceph_vinop(&cf->ci->netfs.inode), + ceph_cap_string(cf->caps), cf->tid, + READ_ONCE(cf->ci->i_last_cap_flush_ack), + cf->wake, + cf->is_capsnap ? " is_capsnap" : ""); + } + spin_unlock(&mdsc->cap_dirty_lock); +} + +/* + * Wait for all cap flushes through @want_flush_tid to complete. + * Periodically dumps pending cap flush state for diagnostics. */ static void wait_caps_flush(struct ceph_mds_client *mdsc, u64 want_flush_tid) { struct ceph_client *cl =3D mdsc->fsc->client; + int i =3D 0; + long ret; =20 doutc(cl, "want %llu\n", want_flush_tid); =20 - wait_event(mdsc->cap_flushing_wq, - check_caps_flush(mdsc, want_flush_tid)); + do { + ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, want_flush_tid), + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ); + if (ret =3D=3D 0 && i++ < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) + dump_cap_flushes(mdsc, want_flush_tid); + } while (ret =3D=3D 0); =20 doutc(cl, "ok, flushed thru %llu\n", want_flush_tid); } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index c89ad8dcc969..1f901b1647e6 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -238,6 +238,7 @@ struct ceph_cap_flush { bool is_capsnap; /* true means capsnap */ struct list_head g_list; // global struct list_head i_list; // per inode + struct ceph_inode_info *ci; }; =20 /* @@ -443,6 +444,11 @@ struct ceph_inode_info { struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or dirty|flushing caps */ unsigned i_snap_caps; /* cap bits for snapped files */ + /* + * Written under i_ceph_lock, read via READ_ONCE() + * from diagnostic paths. + */ + u64 i_last_cap_flush_ack; =20 unsigned long i_last_rd; unsigned long i_last_wr; --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E80863E5EEA for ; Wed, 15 Apr 2026 17:01:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272509; cv=none; b=iEr+Efql6G2KqDzMHmvAFBzRHa72ItUthwYc+q3EFfKqk3nH6e8CiYV05Fe/roInBfhQUNrQJO2LjyhSVQ9mhFZCMrZxlr3UKDsHQsyfQ/Hu2bb/d0BoMIHVCKWIJ2GIO1rHdOAOdcJeAcRokSBh6F6fnP+QLUHkRAkZYWUicA0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272509; c=relaxed/simple; bh=nC4dvPOz/UMoWDa230ND3c0qOR1XY9s4Iu3fbaIELHk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=NDITNcM0wUGoiFqYkPxYJi5RRxmmEBJdrJAJDl+RcaT2ySvOm23T6s0ZDx7lOklFCyENsMqenAohEisyD30o947j66mYd0ApWevTNbJridbG3vFAH6+z3jRjL6HUEa/yu1DBNF609koRtmc9Bg9TB53A/aDChVquTgKVn9V7QTk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=gIiwmfRV; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=bM5eJI3X; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="gIiwmfRV"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="bM5eJI3X" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272505; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=; b=gIiwmfRVRzyKrwgIV3hx7lqHxR80eaAMzuJqLqcABc0kzweSr0/ECmwDNgJYKYZp0xCoz/ 7fMnTZp03QDknl37Nb0zLfTj+xU3VER6joEfYbX/MozLLmR9xpOvdnDiOkDZONdn84HqY4 9EQ+R836decR5rc61aG0Kp/G7JU7XeA= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-692-y5-XxvLiNmm5foY2EKsQNg-1; Wed, 15 Apr 2026 13:01:39 -0400 X-MC-Unique: y5-XxvLiNmm5foY2EKsQNg-1 X-Mimecast-MFC-AGG-ID: y5-XxvLiNmm5foY2EKsQNg_1776272499 Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-50d8e4c29caso177086071cf.0 for ; Wed, 15 Apr 2026 10:01:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272499; x=1776877299; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=; b=bM5eJI3X0jjHsDYR93XUwkOIF94fUtdpvpd4DAvoSHHyFO3cmY77BVws2V4kQZKbZ/ k7sl9RHXFeIO+Rkm7KF6FNx/ROZf2L08LEY3IoO/vRhML1RWyd2kqOUn9r+bMssGQWK8 nU3lml279iITHvqjGksNvYOYPWjZ9dZElZRg3u6zqchSrRM2w7NGRQLm3RL+vhs8u5pa 4qKxo6GfcYo96FMss5ay6+haXiceqYimyz2fYvhxlUBaLRRKVSFa520y6F9uFUpWJuUL kHjCnVRrTLJnXjzrb30IUZ4SX4z9jaBRRvP/PTBMtk8VRw4Z5TlRNy2Uk35Sv96Ja8mf ibwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272499; x=1776877299; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FKxGecMlOZMPAG4oj3+9ulNPv3v8gJrZqzCyY+cN5us=; b=LzjG3ADP07Q7y3daT1BivRBfc1fyZcgZ7VfiptguN7DPYi7+1q+sMAoWXOtnCmwyiw KpLHDdG7LNzZBNHE2uKbZf9TW6kIQB/IIkcWeLR0WNkELXbteFDcusBfncPIdVbB2cY0 b8vEkM/ixI2e8KxfXXO2PqcjQbRuWo/Oi8PnOrhMl/nRI3E7ulItedNMu/z/erAmfGmn nCZu6KyFyYBJsYRxRJ0D38lceg+g2injvwWTG3MmPyoLoX4+JX96PMbATTwVU596oKmI tyV8B+C4my9gipUeQMEDhtn1tFxVo4tiim8nolo3FZM0s4qgNqwcCt4EuJpfLPQZcwbt Hdcw== X-Gm-Message-State: AOJu0Yyy7mc1hfzws7OzShQbjrhQkt4KKLRqAplz9BkWucglKvVMZ+7e r0UJovVDLS7qtOn/BH+u3KNRdDzw/c65AQLUzVTU4KUjJCMlxwUSaE8NpotaWm5QhkWVjnjK5zg ag52dEFZFrjNM6WolW0+pL1DttmvTv50fWV4Ppzm4xsRL8/lxa4QH0q/pTgrGC4GDxg== X-Gm-Gg: AeBDiet39WymquYg1uZ7siC7fnqHLYzH3qKU+xo4mw9TRG8qNYg50gz7k3GnuUH5Sup 3rHTL1AIr3oZDDqUsIlRHoa8WZd+q2Mxr/2ejQndpRVFDQFYIZ8BeM7Wnu6R+xBD5Ll7UXtJZvD FbE05W1lqw4ZZJV6bFtdnqYZAEW70osUkafTgYweX/WmWXjV6Z4Daqebn8Def3Bn7pxfr5CtZUA YcMSi5tyrIGVwwsdC/DoQ0jUkDs6DHIHjDhuIk9wp9tkuuSeaFwDrXNios1n64xcSyKZFJ/xZfg 5BtXQyWjkJzf4eoudE/WWZIoC4HeB0kLpkBqbPH8I8sfyVm2wC7cH2s1CPKsLIn/BZz2zhOTRdu T+2hkJ3ajZQSdRkJ6/enCmzdyZazamchjzhis+U8tX6IlWjsHTVBzOfwE3b7WtlSjHg== X-Received: by 2002:a05:622a:1a8e:b0:50d:7b0c:35e7 with SMTP id d75a77b69052e-50dd5c6cd3amr330619191cf.43.1776272491969; Wed, 15 Apr 2026 10:01:31 -0700 (PDT) X-Received: by 2002:a05:622a:1a8e:b0:50d:7b0c:35e7 with SMTP id d75a77b69052e-50dd5c6cd3amr330599741cf.43.1776272472957; Wed, 15 Apr 2026 10:01:12 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:12 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 5/7] ceph: add client reset state machine and session teardown Date: Wed, 15 Apr 2026 17:00:41 +0000 Message-Id: <20260415170043.3882912-6-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze --- fs/ceph/locks.c | 16 ++ fs/ceph/mds_client.c | 421 +++++++++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.h | 42 +++++ 3 files changed, 479 insertions(+) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index c4ff2266bb94..677221bd64e0 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_l= ock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u16 op =3D CEPH_MDS_OP_SETFILELOCK; @@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_= lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (lock_is_read(fl)) lock_cmd =3D CEPH_LOCK_SHARED; else if (lock_is_write(fl)) @@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_= lock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u8 wait =3D 0; @@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file= _lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (!lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (IS_SETLKW(cmd)) wait =3D 1; =20 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 7d17332d72d7..7e399b0dcc55 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -67,6 +68,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc, struct list_head *head); static void ceph_cap_release_work(struct work_struct *work); static void ceph_cap_reclaim_work(struct work_struct *work); +static void ceph_mdsc_reset_workfn(struct work_struct *work); =20 static const struct ceph_connection_operations mds_con_ops; =20 @@ -3756,6 +3758,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client = *mdsc, struct inode *dir, struct ceph_client *cl =3D mdsc->fsc->client; int err =3D 0; =20 + /* + * If a reset is in progress, wait for it to complete. + * + * This is best-effort: a request can pass this check just + * before the phase leaves IDLE and proceed concurrently with + * reset. That is acceptable because (a) such requests will + * either complete normally or fail and be retried by the + * caller, and (b) adding lock serialization here would + * penalize every request for a rare manual operation. + */ + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) { + doutc(cl, "wait_for_reset failed: %d\n", err); + return err; + } + /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */ if (req->r_inode) ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN); @@ -5163,6 +5181,387 @@ static int send_mds_reconnect(struct ceph_mds_clien= t *mdsc, return err; } =20 +const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase) +{ + switch (phase) { + case CEPH_CLIENT_RESET_IDLE: return "idle"; + case CEPH_CLIENT_RESET_QUIESCING: return "quiescing"; + case CEPH_CLIENT_RESET_DRAINING: return "draining"; + case CEPH_CLIENT_RESET_TEARDOWN: return "teardown"; + default: return "unknown"; + } +} + +/* + * Wait for an active reset to complete. + * Returns 0 if reset completed successfully or no reset was active. + * Returns -ETIMEDOUT if we timed out waiting. + * Returns -ERESTARTSYS if interrupted by signal. + */ +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + unsigned long deadline =3D jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC *= HZ; + int blocked_count; + long wait_ret; + int ret; + + if (READ_ONCE(st->phase) =3D=3D CEPH_CLIENT_RESET_IDLE) + return 0; + + blocked_count =3D atomic_inc_return(&st->blocked_requests); + doutc(cl, "request blocked during reset, %d total blocked\n", + blocked_count); + +retry: + wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq, + READ_ONCE(st->phase) =3D=3D + CEPH_CLIENT_RESET_IDLE, + deadline - jiffies); + + if (wait_ret =3D=3D 0) { + atomic_dec(&st->blocked_requests); + pr_warn_client(cl, "timed out waiting for reset to complete\n"); + return -ETIMEDOUT; + } + if (wait_ret < 0) { + atomic_dec(&st->blocked_requests); + return (int)wait_ret; /* -ERESTARTSYS */ + } + + /* + * Verify phase is still IDLE under the lock. If another reset + * was scheduled between the wake-up and this check, loop back + * and wait for it to finish rather than returning a stale result. + */ + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + if (time_before(jiffies, deadline)) + goto retry; + atomic_dec(&st->blocked_requests); + return -ETIMEDOUT; + } + ret =3D st->last_errno; + spin_unlock(&st->lock); + + atomic_dec(&st->blocked_requests); + return ret; +} + +static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + /* + * If destroy already marked us as shut down, it owns the + * final bookkeeping. Just bail so we don't overwrite the + * -ESHUTDOWN result that waiters already observed. + */ + if (st->shutdown) { + spin_unlock(&st->lock); + return; + } + st->last_finish =3D jiffies; + st->last_errno =3D ret; + st->phase =3D CEPH_CLIENT_RESET_IDLE; + if (ret) + st->failure_count++; + else + st->success_count++; + spin_unlock(&st->lock); + + /* Wake up all requests that were blocked waiting for reset */ + wake_up_all(&st->blocked_wq); +} + +static void ceph_mdsc_reset_workfn(struct work_struct *work) +{ + struct ceph_mds_client *mdsc =3D + container_of(work, struct ceph_mds_client, reset_work); + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + struct ceph_mds_session **sessions =3D NULL; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + int max_sessions, i, n =3D 0, torn_down =3D 0; + int ret =3D 0; + + spin_lock(&st->lock); + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + mutex_lock(&mdsc->mutex); + max_sessions =3D mdsc->max_sessions; + if (max_sessions <=3D 0) { + mutex_unlock(&mdsc->mutex); + goto out_complete; + } + + sessions =3D kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL); + if (!sessions) { + mutex_unlock(&mdsc->mutex); + ret =3D -ENOMEM; + pr_err_client(cl, + "manual session reset failed to allocate session array\n"); + ceph_mdsc_reset_complete(mdsc, ret); + return; + } + + for (i =3D 0; i < max_sessions; i++) { + struct ceph_mds_session *session =3D mdsc->sessions[i]; + + if (!session) + continue; + + /* + * Read session state without s_mutex to avoid nesting + * mdsc->mutex -> s_mutex, which would invert the + * s_mutex -> mdsc->mutex order used by + * cleanup_session_requests(). s_state is an int + * so loads are atomic; the teardown loop below + * handles races with concurrent state transitions. + */ + switch (READ_ONCE(session->s_state)) { + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + sessions[n++] =3D ceph_get_mds_session(session); + break; + default: + pr_info_client(cl, + "mds%d in state %s, skipping reset\n", + session->s_mds, + ceph_session_state_name(session->s_state)); + break; + } + } + mutex_unlock(&mdsc->mutex); + + pr_info_client(cl, + "manual session reset executing (sessions=3D%d, reason=3D\"%s\")\= n", + n, reason); + + if (n =3D=3D 0) { + kfree(sessions); + goto out_complete; + } + + spin_lock(&st->lock); + st->phase =3D CEPH_CLIENT_RESET_DRAINING; + spin_unlock(&st->lock); + + /* + * Best-effort drain: flush dirty state while sessions are still + * alive. New requests are blocked while phase !=3D IDLE. + * The sessions are functional, so non-stuck state drains normally. + * Stuck state (the cause of the stalemate the operator is trying + * to break) will not drain - that is expected, and we proceed to + * forced teardown after the timeout. + * + * Three things are drained: + * 1. MDS journal - send_flush_mdlog asks each MDS to journal + * pending unsafe operations (creates, renames, setattrs). + * Once journaled, they survive the session teardown. + * 2. Dirty caps - ceph_flush_dirty_caps triggers cap flush on + * all sessions. Non-stuck caps flush in milliseconds. + * 3. Cap releases - push pending cap release messages. + * + * All three happen concurrently during the bounded wait window. + */ + for (i =3D 0; i < n; i++) + send_flush_mdlog(sessions[i]); + + ceph_flush_dirty_caps(mdsc); + ceph_flush_cap_releases(mdsc); + + spin_lock(&mdsc->cap_dirty_lock); + if (!list_empty(&mdsc->cap_flush_list)) { + struct ceph_cap_flush *cf =3D + list_last_entry(&mdsc->cap_flush_list, + struct ceph_cap_flush, g_list); + u64 want_flush =3D mdsc->last_cap_flush_tid; + long drain_ret; + + /* + * Setting wake on the last entry is sufficient: flush + * entries complete in order, so when this entry finishes + * all earlier ones are already done. + */ + cf->wake =3D true; + spin_unlock(&mdsc->cap_dirty_lock); + pr_info_client(cl, + "draining (want_flush=3D%llu, %d sessions)\n", + want_flush, n); + drain_ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, + want_flush), + CEPH_CLIENT_RESET_DRAIN_SEC * HZ); + if (drain_ret =3D=3D 0) { + pr_info_client(cl, + "drain timed out, proceeding with forced teardown\n"); + spin_lock(&st->lock); + st->drain_timed_out =3D true; + spin_unlock(&st->lock); + } else { + pr_info_client(cl, "drain completed successfully\n"); + spin_lock(&st->lock); + st->drain_timed_out =3D false; + spin_unlock(&st->lock); + } + } else { + spin_unlock(&mdsc->cap_dirty_lock); + spin_lock(&st->lock); + st->drain_timed_out =3D false; + spin_unlock(&st->lock); + } + + spin_lock(&st->lock); + st->phase =3D CEPH_CLIENT_RESET_TEARDOWN; + spin_unlock(&st->lock); + + /* + * Ask each MDS to close the session before we tear it down + * locally. Without this the MDS sees only a connection drop and + * waits for the client to reconnect (up to session_autoclose + * seconds) before evicting the session and releasing locks. + * + * Reuse the normal close machinery so the session state/sequence + * snapshot is serialized under s_mutex and a racing s_seq bump + * retransmits REQUEST_CLOSE while the session remains CLOSING. + * We send all close requests first, then yield briefly to let the + * network stack transmit them before __unregister_session() + * closes the connections. + */ + for (i =3D 0; i < n; i++) { + int err; + + mutex_lock(&sessions[i]->s_mutex); + err =3D __close_session(mdsc, sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + if (err < 0) + pr_warn_client(cl, + "mds%d failed to queue close request before reset: %d\n", + sessions[i]->s_mds, err); + } + if (n > 0) + msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS); + + /* + * Tear down each session: close the connection, remove all + * caps, clean up requests, then kick pending requests so they + * re-open a fresh session on the next attempt. + * + * This is modeled on the check_new_map() forced-close path + * for stopped MDS ranks - a proven pattern for hard session + * teardown. We do NOT attempt send_mds_reconnect() because + * the MDS only accepts reconnects during its own RECONNECT + * phase (after MDS restart), not from an active client. + * + * Any state that did not drain (caps that didn't flush, unsafe + * requests that the MDS didn't journal) is force-dropped here. + * This is intentional: that state is stuck and is the reason + * the operator triggered the reset. + */ + for (i =3D 0; i < n; i++) { + int mds =3D sessions[i]->s_mds; + + pr_info_client(cl, "mds%d resetting session\n", mds); + + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || + mdsc->sessions[mds] !=3D sessions[i]) { + pr_info_client(cl, + "mds%d session already torn down, skipping\n", + mds); + mutex_unlock(&mdsc->mutex); + ceph_put_mds_session(sessions[i]); + continue; + } + sessions[i]->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, sessions[i]); + __wake_requests(mdsc, &sessions[i]->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&sessions[i]->s_mutex); + cleanup_session_requests(mdsc, sessions[i]); + remove_session_caps(sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + ceph_put_mds_session(sessions[i]); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, mds); + mutex_unlock(&mdsc->mutex); + + torn_down++; + pr_info_client(cl, "mds%d session reset complete\n", mds); + } + + kfree(sessions); + + spin_lock(&st->lock); + st->sessions_reset =3D torn_down; + spin_unlock(&st->lock); + +out_complete: + ceph_mdsc_reset_complete(mdsc, ret); +} + +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_fs_client *fsc =3D mdsc->fsc; + const char *msg =3D (reason && reason[0]) ? reason : "manual"; + int mount_state; + + mount_state =3D READ_ONCE(fsc->mount_state); + if (mount_state !=3D CEPH_MOUNT_MOUNTED) { + pr_warn_client(fsc->client, + "reset rejected: mount_state=3D%d (not mounted)\n", + mount_state); + return -EINVAL; + } + + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + return -EBUSY; + } + + st->phase =3D CEPH_CLIENT_RESET_QUIESCING; + st->last_start =3D jiffies; + st->last_errno =3D 0; + st->drain_timed_out =3D false; + st->sessions_reset =3D 0; + st->trigger_count++; + strscpy(st->last_reason, msg, sizeof(st->last_reason)); + spin_unlock(&st->lock); + + if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) { + spin_lock(&st->lock); + st->phase =3D CEPH_CLIENT_RESET_IDLE; + st->last_errno =3D -EALREADY; + st->last_finish =3D jiffies; + st->failure_count++; + spin_unlock(&st->lock); + wake_up_all(&st->blocked_wq); + return -EALREADY; + } + + pr_info_client(mdsc->fsc->client, + "manual session reset scheduled (reason=3D\"%s\")\n", + msg); + return 0; +} + =20 /* * compare old and new mdsmaps, kicking requests @@ -5702,6 +6101,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) INIT_LIST_HEAD(&mdsc->dentry_leases); INIT_LIST_HEAD(&mdsc->dentry_dir_leases); =20 + spin_lock_init(&mdsc->reset_state.lock); + init_waitqueue_head(&mdsc->reset_state.blocked_wq); + atomic_set(&mdsc->reset_state.blocked_requests, 0); + INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn); + ceph_caps_init(mdsc); ceph_adjust_caps_max_min(mdsc, fsc->mount_options); =20 @@ -6227,6 +6631,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc) /* flush out any connection work with references to us */ ceph_msgr_flush(); =20 + /* + * Mark reset as failed and wake any blocked waiters before + * cancelling, so unmount doesn't stall on blocked_wq timeout + * if cancel_work_sync() prevents the work from running. + */ + spin_lock(&mdsc->reset_state.lock); + mdsc->reset_state.shutdown =3D true; + if (mdsc->reset_state.phase !=3D CEPH_CLIENT_RESET_IDLE) { + mdsc->reset_state.phase =3D CEPH_CLIENT_RESET_IDLE; + mdsc->reset_state.last_errno =3D -ESHUTDOWN; + mdsc->reset_state.last_finish =3D jiffies; + mdsc->reset_state.failure_count++; + } + spin_unlock(&mdsc->reset_state.lock); + wake_up_all(&mdsc->reset_state.blocked_wq); + + cancel_work_sync(&mdsc->reset_work); ceph_mdsc_stop(mdsc); =20 ceph_metric_destroy(&mdsc->metric); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index e91a199d56fd..afc08b0abbd5 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -74,6 +74,42 @@ struct ceph_fs_client; struct ceph_cap; =20 #define MDS_AUTH_UID_ANY -1 +#define CEPH_CLIENT_RESET_REASON_LEN 64 +#define CEPH_CLIENT_RESET_DRAIN_SEC 5 +#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100 +#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120 + +enum ceph_client_reset_phase { + CEPH_CLIENT_RESET_IDLE =3D 0, + /* + * QUIESCING is set synchronously by schedule_reset() before the + * workqueue item is dispatched. It gates new requests (any + * phase !=3D IDLE blocks callers) during the window between + * scheduling and the work function's transition to DRAINING. + */ + CEPH_CLIENT_RESET_QUIESCING, + CEPH_CLIENT_RESET_DRAINING, + CEPH_CLIENT_RESET_TEARDOWN, +}; + +struct ceph_client_reset_state { + spinlock_t lock; + u64 trigger_count; + u64 success_count; + u64 failure_count; + unsigned long last_start; + unsigned long last_finish; + int last_errno; + enum ceph_client_reset_phase phase; + bool drain_timed_out; + bool shutdown; + int sessions_reset; + char last_reason[CEPH_CLIENT_RESET_REASON_LEN]; + + /* Request blocking during reset */ + wait_queue_head_t blocked_wq; + atomic_t blocked_requests; +}; =20 struct ceph_mds_cap_match { s64 uid; /* default to MDS_AUTH_UID_ANY */ @@ -536,6 +572,8 @@ struct ceph_mds_client { struct list_head dentry_dir_leases; /* lru list */ =20 struct ceph_client_metric metric; + struct work_struct reset_work; + struct ceph_client_reset_state reset_state; =20 spinlock_t snapid_map_lock; struct rb_root snapid_map_tree; @@ -559,10 +597,14 @@ extern struct ceph_mds_session * __ceph_lookup_mds_session(struct ceph_mds_client *, int mds); =20 extern const char *ceph_session_state_name(int s); +extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phas= e); =20 extern struct ceph_mds_session * ceph_get_mds_session(struct ceph_mds_session *s); extern void ceph_put_mds_session(struct ceph_mds_session *s); +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason); +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc); =20 extern int ceph_mdsc_init(struct ceph_fs_client *fsc); extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc); --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58F843E6DD8 for ; Wed, 15 Apr 2026 17:01:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272519; cv=none; b=kprYWWAFqN+BQSoZU64ZhRHXNpKT3jPRWR+kCoeT+n8Zbg1nuTZYjhKN/9lSUCj2ZkaIQyJ7NXXKpTTkOQiqn49LJGlIKF5KQpwAk/xdgTfg6aXZwyGoMTAKbUXZsUqwrAk8fOVx6+qnR9KEp2+ZdumzoHjUfzPAOe3CJvDJ5dE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272519; c=relaxed/simple; bh=hF8GPPMCTtYMk3vi8Lxz1SJG7tR0daTJyKi+xKmaWuE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=sPO/Sz6JM8e7Mo1Onp6WC1t6jPyzH7jP20zYXo0yeot0JZqpl6gp0nmup9Nekqbt1ulksgpuDTfbXr1fn1Z+gl7gNjtaMXwTsN4Arb8DwICAEK+JYzTOvbGhdBWeAglr5qX8LxPCMAAaf76dWgulB2/EVfE6qYADaJyS4fRVQIM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=TdW+ar5v; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=I3tYKJ+G; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="TdW+ar5v"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="I3tYKJ+G" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272516; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=; b=TdW+ar5vzCrQjsoU7U9J6eijXXi19KNhAhqgCaKkFo+dpjngg1ekLpHsYR6d3Vn925yEJH Tx7mX0hWwWp19goKvEvNtQP0At5ECVk+v4jVZ7vb0ADPy6ouHTPUYvMIa6ytQRmjKRw2T/ g1mYhwEUqJX2YNr23w2DMAot/WZfX5E= Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com [209.85.160.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-676-5qGr1ZYVNQSZTMmQtnYe_Q-1; Wed, 15 Apr 2026 13:01:55 -0400 X-MC-Unique: 5qGr1ZYVNQSZTMmQtnYe_Q-1 X-Mimecast-MFC-AGG-ID: 5qGr1ZYVNQSZTMmQtnYe_Q_1776272515 Received: by mail-qt1-f197.google.com with SMTP id d75a77b69052e-50d63962d83so167608931cf.2 for ; Wed, 15 Apr 2026 10:01:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272515; x=1776877315; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=; b=I3tYKJ+GEzydugY/VNSCeqgJLjGY8WttVEuEIt8RpD3tDlrkYgTn+y02eWb0/mBWMZ Y0HGzfPUdkmrt+ir9PckYRezHRet8VSMWW/YuSpFuAXWtAOrPD2M3Qk6tKhrgc6ZmdNI pfSBuDcZFXSf0o81lOuHe+S0X0rEjnCobq1NZaMJReMDV0I8OW0WJlkLTHhz78rVI6y4 e32btVtnFdZQQ+inXP/EVCYlEeWJU9PAfonXcxSJ72eru/zGln0ortBdeFuKYJwOUaAG S0f2kkFEkUrIsZEF6WZ0Jt5QbNEVWDXeLYIQnHMIOjUPBgh1IJhkYQWNKoZUDrWqtAea x4tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272515; x=1776877315; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=//5vy/hAvLLk2tz3ORfdr9CS0acwofxbtDJCElDQmN4=; b=iEGVXdXyjwYrIXQonio/HXaq3oLfen0aFJ6PeaQXtaa1T37zfPbG/yZSJuv5fWBLWS 6eaQvruhbpSmpFd1no98k3tyTonb5pIitRouJx/6y+WikLZna/hOWa/kSiTZvxgYiLbU cpr/AtSN8D53Za4BZam1ZUj/4z/H/lFsJpg+8BEmp/mSI2S7/83GUA3Cm88V4sCg2Meg psqVxoAc7b5BUX92+4iNHnBMXXkU5JzjSbpypxp2jVgEYsb4+rZvzB/1wYAQFNyrIGyD 8CkSTZoQm3fBC/7d4q6yM1z7CV0JSoWg63zWd5hpcgFir9v6pzchZp4uPyoO05unNj1J 3AxA== X-Gm-Message-State: AOJu0YyPN3w40stg88x5XoGdJJyolyG4g3VstB0rUXVFiuRmfmC4kB0g awPD7Dgk6jVxDKEkfACMJWHCEO0R2qABD8dLrVB04GinBBWlufveWyyssGEqqjICF8Arttt9NWQ Q9FFRRw7SFtbFpr/pgfgj2Ejjj0ZDsLfz3aZO5163BmzfbMwPAHAx6XzZnJf9bh9ijA== X-Gm-Gg: AeBDiesh2UQbQaZIwCXovx7gDfmkGE4rPVLwNS7+VYtFZ5AUj3DM7v0xvEVY0Ahrqbf uzcqY0q3KHXzVwgRI8bDeZIbYSS0gdZWU3JzrSc/ub//ohJhEPUm3ZqIu23OEdEe5SxPuiCepkr LGoxYiS7iyC1gZBwI/DU9ax7fbOBi268eJm0EdC972ZhbzVcliv/CCpsHZRMjQCL619zRxLoUpA SHt9XB74O48G7bntHFbzjxrTt33sWNQ+UFRMqe6OQT5x328JP22ovMWpXKNZdt7xVmYpqqQdmoW +I1+1PXsjwSEshnVrTPlQfQlgV/gnW7INaw7wj0ST+P37jeu91MuK9RTAneaj+B57BZ3yPjrNRx rajSqli6doR65eigVekdSjgCsG0r/Meq0qhupBNMvEgL+iS+A0wEDpucfCJGktj7zVg== X-Received: by 2002:a05:622a:9:b0:50d:7b0c:35de with SMTP id d75a77b69052e-50dd5c74756mr343034801cf.44.1776272506493; Wed, 15 Apr 2026 10:01:46 -0700 (PDT) X-Received: by 2002:a05:622a:9:b0:50d:7b0c:35de with SMTP id d75a77b69052e-50dd5c74756mr343006191cf.44.1776272475636; Wed, 15 Apr 2026 10:01:15 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:15 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 6/7] ceph: add manual reset debugfs control and tracepoints Date: Wed, 15 Apr 2026 17:00:42 +0000 Message-Id: <20260415170043.3882912-7-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the debugfs and trace plumbing used to trigger and observe manual client reset. The reset interface exposes a trigger file for operator-initiated reset and a status file for tracking the most recent run. The tracepoints record scheduling, completion, and blocked caller behavior so reset progress can be diagnosed from the client side. debugfs layout under /sys/kernel/debug/ceph//reset/: trigger - write to initiate a manual reset status - read to see the most recent reset result Tracepoints: ceph_client_reset_schedule - reset queued ceph_client_reset_complete - reset finished (success or failure) ceph_client_reset_blocked - caller blocked waiting for reset ceph_client_reset_unblocked - caller unblocked after reset Signed-off-by: Alex Markuze --- fs/ceph/debugfs.c | 104 ++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.c | 8 +++ fs/ceph/super.h | 3 ++ include/trace/events/ceph.h | 63 ++++++++++++++++++++++ 4 files changed, 178 insertions(+) diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index 7dc307790240..d46d41ec7a86 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -9,6 +9,7 @@ #include #include #include +#include =20 #include #include @@ -360,16 +361,107 @@ static int status_show(struct seq_file *s, void *p) return 0; } =20 +static int reset_status_show(struct seq_file *s, void *p) +{ + struct ceph_fs_client *fsc =3D s->private; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + struct ceph_client_reset_state *st; + u64 trigger =3D 0, success =3D 0, failure =3D 0; + unsigned long last_start =3D 0, last_finish =3D 0; + int last_errno =3D 0; + enum ceph_client_reset_phase phase =3D CEPH_CLIENT_RESET_IDLE; + bool drain_timed_out =3D false; + int sessions_reset =3D 0; + int blocked_requests =3D 0; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + + if (!mdsc) + return 0; + + st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + trigger =3D st->trigger_count; + success =3D st->success_count; + failure =3D st->failure_count; + last_start =3D st->last_start; + last_finish =3D st->last_finish; + last_errno =3D st->last_errno; + phase =3D st->phase; + drain_timed_out =3D st->drain_timed_out; + sessions_reset =3D st->sessions_reset; + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + blocked_requests =3D atomic_read(&st->blocked_requests); + + seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase)); + seq_printf(s, "trigger_count: %llu\n", trigger); + seq_printf(s, "success_count: %llu\n", success); + seq_printf(s, "failure_count: %llu\n", failure); + if (last_start) + seq_printf(s, "last_start_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_start)); + else + seq_puts(s, "last_start_ms_ago: (never)\n"); + if (last_finish) + seq_printf(s, "last_finish_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_finish)); + else + seq_puts(s, "last_finish_ms_ago: (never)\n"); + seq_printf(s, "last_errno: %d\n", last_errno); + seq_printf(s, "last_reason: %s\n", + reason[0] ? reason : "(none)"); + seq_printf(s, "drain_timed_out: %s\n", + drain_timed_out ? "yes" : "no"); + seq_printf(s, "sessions_reset: %d\n", sessions_reset); + seq_printf(s, "blocked_requests: %d\n", blocked_requests); + + return 0; +} + +static ssize_t reset_trigger_write(struct file *file, const char __user *b= uf, + size_t len, loff_t *ppos) +{ + struct ceph_fs_client *fsc =3D file->private_data; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + size_t copy; + int ret; + + if (!mdsc) + return -ENODEV; + + copy =3D min_t(size_t, len, sizeof(reason) - 1); + if (copy && copy_from_user(reason, buf, copy)) + return -EFAULT; + reason[copy] =3D '\0'; + strim(reason); + + ret =3D ceph_mdsc_schedule_reset(mdsc, reason); + if (ret) + return ret; + + return len; +} + DEFINE_SHOW_ATTRIBUTE(mdsmap); DEFINE_SHOW_ATTRIBUTE(mdsc); DEFINE_SHOW_ATTRIBUTE(caps); DEFINE_SHOW_ATTRIBUTE(mds_sessions); DEFINE_SHOW_ATTRIBUTE(status); +DEFINE_SHOW_ATTRIBUTE(reset_status); DEFINE_SHOW_ATTRIBUTE(metrics_file); DEFINE_SHOW_ATTRIBUTE(metrics_latency); DEFINE_SHOW_ATTRIBUTE(metrics_size); DEFINE_SHOW_ATTRIBUTE(metrics_caps); =20 +static const struct file_operations ceph_reset_trigger_fops =3D { + .owner =3D THIS_MODULE, + .open =3D simple_open, + .write =3D reset_trigger_write, + .llseek =3D noop_llseek, +}; =20 /* * debugfs @@ -404,6 +496,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc) debugfs_remove(fsc->debugfs_caps); debugfs_remove(fsc->debugfs_status); debugfs_remove(fsc->debugfs_mdsc); + debugfs_remove_recursive(fsc->debugfs_reset_dir); debugfs_remove_recursive(fsc->debugfs_metrics_dir); doutc(fsc->client, "done\n"); } @@ -451,6 +544,17 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc) fsc, &caps_fops); =20 + fsc->debugfs_reset_dir =3D debugfs_create_dir("reset", + fsc->client->debugfs_dir); + fsc->debugfs_reset_trigger =3D + debugfs_create_file("trigger", 0200, + fsc->debugfs_reset_dir, fsc, + &ceph_reset_trigger_fops); + fsc->debugfs_reset_status =3D + debugfs_create_file("status", 0400, + fsc->debugfs_reset_dir, fsc, + &reset_status_fops); + fsc->debugfs_status =3D debugfs_create_file("status", 0400, fsc->client->debugfs_dir, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 7e399b0dcc55..98a882cf8b65 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -5213,6 +5213,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *= mdsc) blocked_count =3D atomic_inc_return(&st->blocked_requests); doutc(cl, "request blocked during reset, %d total blocked\n", blocked_count); + trace_ceph_client_reset_blocked(mdsc, blocked_count); =20 retry: wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq, @@ -5223,10 +5224,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (wait_ret =3D=3D 0) { atomic_dec(&st->blocked_requests); pr_warn_client(cl, "timed out waiting for reset to complete\n"); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } if (wait_ret < 0) { atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret); return (int)wait_ret; /* -ERESTARTSYS */ } =20 @@ -5241,12 +5244,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (time_before(jiffies, deadline)) goto retry; atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } ret =3D st->last_errno; spin_unlock(&st->lock); =20 atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, ret); return ret; } =20 @@ -5275,6 +5280,8 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_= client *mdsc, int ret) =20 /* Wake up all requests that were blocked waiting for reset */ wake_up_all(&st->blocked_wq); + + trace_ceph_client_reset_complete(mdsc, ret); } =20 static void ceph_mdsc_reset_workfn(struct work_struct *work) @@ -5559,6 +5566,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *= mdsc, pr_info_client(mdsc->fsc->client, "manual session reset scheduled (reason=3D\"%s\")\n", msg); + trace_ceph_client_reset_schedule(mdsc, msg); return 0; } =20 diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 1f901b1647e6..98af0a823c81 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -179,6 +179,9 @@ struct ceph_fs_client { struct dentry *debugfs_status; struct dentry *debugfs_mds_sessions; struct dentry *debugfs_metrics_dir; + struct dentry *debugfs_reset_dir; + struct dentry *debugfs_reset_trigger; + struct dentry *debugfs_reset_status; #endif =20 #ifdef CONFIG_CEPH_FSCACHE diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h index 08cb0659fbfc..e853c891ef71 100644 --- a/include/trace/events/ceph.h +++ b/include/trace/events/ceph.h @@ -226,6 +226,69 @@ TRACE_EVENT(ceph_handle_caps, __entry->mseq) ); =20 +/* + * Client reset tracepoints - identify the client by its monitor- + * assigned global_id so traces remain meaningful when kernel pointer + * hashing is enabled. + */ +TRACE_EVENT(ceph_client_reset_schedule, + TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason), + TP_ARGS(mdsc, reason), + TP_STRUCT__entry( + __field(u64, client_id) + __string(reason, reason ? reason : "") + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth->global_id; + __assign_str(reason); + ), + TP_printk("client_id=3D%llu reason=3D%s", + __entry->client_id, __get_str(reason)) +); + +TRACE_EVENT(ceph_client_reset_complete, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth->global_id; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + +TRACE_EVENT(ceph_client_reset_blocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count), + TP_ARGS(mdsc, blocked_count), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, blocked_count) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth->global_id; + __entry->blocked_count =3D blocked_count; + ), + TP_printk("client_id=3D%llu blocked_count=3D%d", __entry->client_id, + __entry->blocked_count) +); + +TRACE_EVENT(ceph_client_reset_unblocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth->global_id; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + #undef EM #undef E_ #endif /* _TRACE_CEPH_H */ --=20 2.34.1 From nobody Sat Jun 20 11:50:30 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0226D3E5576 for ; Wed, 15 Apr 2026 17:01:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272494; cv=none; b=c9/bhAprNCgAd5J3ueVShW2Bjnig/ypVfFA+SEHdA4RpEEhK1Bckirm/+YmSo4beHEx+yVTuQFliAvP1RI06Gwe+ETi2EHojDBD0XRIl15uBtJBaBlwtVJ8IzL/GEuFPNWuVYBrVOaFWqzhsgXX0IajyFprMZlh9Vukl2Yia1h8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776272494; c=relaxed/simple; bh=8gT6vokTWEKbUxvLqk4CByx7e5d1SpvfNxrx6daGEGY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=EpGymvBP6qbIkXTeFjSo8d7crhrMtQKzXinlbamRxOm0kkGnIARlZWs4elSMN+rGox1RiKiA2jmX7fKKTk07ZypZU6st0cyrBCq9YtzlQGMorojMZMEDMSvSKLa7SSE3i5EqyNSGsb5klEJc3G1fYrnvKW3ANWeKUHiy2D3ZR8A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Y5niHz17; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=Wp8ZB46h; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Y5niHz17"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="Wp8ZB46h" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776272488; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=; b=Y5niHz17TVz1rIzxBfeHG16cIJu6F9OlGawOcF8IyQwwZcx4pa50CZykUnnrrl7VBBE9f7 xzLpYfglMVKJbNPvUAZaprqg6Sqkdu1uvxn0EWnbtrRQi0VzfxKU6DKDYewK+UpLFFLoc5 oxiDRHznAreyfDC2qoEBUUGTZ/NDBWo= Received: from mail-ot1-f69.google.com (mail-ot1-f69.google.com [209.85.210.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-277-WPhmCdoCMdeKzyQBLIxEQQ-1; Wed, 15 Apr 2026 13:01:26 -0400 X-MC-Unique: WPhmCdoCMdeKzyQBLIxEQQ-1 X-Mimecast-MFC-AGG-ID: WPhmCdoCMdeKzyQBLIxEQQ_1776272485 Received: by mail-ot1-f69.google.com with SMTP id 46e09a7af769-7d9d60f8e3aso13836722a34.3 for ; Wed, 15 Apr 2026 10:01:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1776272485; x=1776877285; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=; b=Wp8ZB46h427/wMe2iECK6izSv3+Z9F9mkKfLHHKDtT9LVTgK8jaWZscS+/u/PrNzxX SQ1LbA2t4c1BQYBYupY1Pd3bH9nZMUiOeeNUcpdPYM6l7677wP4QOabTZsc2NGuaGnob 9l+ly7uD/L2GbhxdPjD2o/F38IdCZzh13/3Q66yupISaYIPevZ0QX6j6zJd2TRx53ONF 32jUC+qey7+XV5sHjAG2O+EI9P3iU2BM+bm8Lbpw1ZG9AekDXiveAwG8F1SVLdTellhx hS9dRAMZq74UDK+xKsvTGorAXg4GQrPepgQP9iwuaEgwUJ5wFUw46gIu9gVy3ygnvu1S sq1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776272485; x=1776877285; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=BfxCWVgPSKWyOh8SVRbUsC8uFGpJNNAuHhK4HXyyfX4=; b=sKOX9H5Qc0/MMxcxyXETWYvm6KRJh552FrlC/+Ij6PbdHDd41488qJ0cV2ynqcQV3a w42yzUwqwdNIOZIw97BB+YoiFl3q8Mty4gXn8OnyDn6MR5vA0/Zx6Ek6Qkqp96UTsdpv 1guUM5SXy1cEUML0uhiJljXusNTnyhWkRff10Z/t5zEbklXQSZka4wPfYXhL9lTE3ubG NmU4dWCAHOuek5LHl/yq0wVpJBV+pxLa8/ufLVGKsUIxbXRWQmNK0pe3G70P01gH2rHZ 5Chp26bVmD1PX9GtYdTG95cqBlkYN/sATo2/9yPTFCfkYMm1GvVdqeqgvlofCaOcf1oi 4oYQ== X-Gm-Message-State: AOJu0YwwIWDyvQ7z/KnAcgNbDhMeR02sxIynB75iWwJwS64rdhCaK0ts lKnoOWVN1Kp44jl9otUjhaZTjyb2r/CZ6jM+J1AhF+H43UlBfd+I+jXIITm6x3dDvh6Yzo2wQEy 4rYXiUJ7Z7dy9gxZ0i946AmfIvLz6jG8si/mo29tu4plfeqrLC3uPl8t3V9UAeAvi62JhTber5w eW X-Gm-Gg: AeBDiestuDbXc0fRn3jYlpm6ZJU4A/zThsbSDOoHe+aH20fC0siyWj6bfvKEKbY7k21 U6T83cgL7Lasu+NhUXldmFdupT+WM0jG5JuInWDnljkyXl96SUcNnk9lMrTVXPkfHBMKguS1Kpd 33R288n1UhwwmF5l6MJjV+oQ/Fwh+4CXlX2NA2X5m+m7n5UBQeeKDY7pkFtr+t2em3OykHfqylK DrRWft/WxenxjJ9cxEL90g24KKZ7dYfT/C9feMqfhJkk/nqTNPbUnO1ONB8Pt0lBndyiyYgKu4Y +cbfRszBXHpDcd0CC2NHpm5AExLkIm4zzaRCZOP7Mv2KeU8ambfP57wyQndsiGQ+HefJsQJcsxH AkDyP/ha2cqMCsoJ6d/MBDlmtn14kVw6vZyowu7lc/9u8aymJnu+rjJGperL/EhS4Mw== X-Received: by 2002:a05:6820:168d:b0:68a:d414:b428 with SMTP id 006d021491bc7-68be8fcba82mr11040519eaf.59.1776272480465; Wed, 15 Apr 2026 10:01:20 -0700 (PDT) X-Received: by 2002:a05:6820:168d:b0:68a:d414:b428 with SMTP id 006d021491bc7-68be8fcba82mr11040407eaf.59.1776272478882; Wed, 15 Apr 2026 10:01:18 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-50e1af9dc5fsm16817841cf.16.2026.04.15.10.01.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Apr 2026 10:01:18 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v2 7/7] ceph: add manual reset selftests and validation harness Date: Wed, 15 Apr 2026 17:00:43 +0000 Message-Id: <20260415170043.3882912-8-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260415170043.3882912-1-amarkuze@redhat.com> References: <20260415170043.3882912-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add single-client selftests and a validation wrapper for manual client reset. The test set covers reset stress under concurrent metadata activity together with targeted corner cases for overlap, dirty-state handling, stale lock behavior, and unmount while reset is active. A validation wrapper runs the individual stages with watchdog timeouts and captures the final reset status for post-run checks. The stress validator checks failure_count in addition to last_errno so that transient mid-run reset failures are caught even when a later reset succeeds. Keep the test scope intentionally focused on the shipped single-client reset behavior so the series includes a practical regression signal for the final design. Signed-off-by: Alex Markuze --- MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/filesystems/ceph/Makefile | 7 + .../selftests/filesystems/ceph/README.md | 84 +++ .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++++ .../filesystems/ceph/reset_stress.sh | 694 ++++++++++++++++++ .../filesystems/ceph/run_validation.sh | 350 +++++++++ .../selftests/filesystems/ceph/settings | 1 + .../filesystems/ceph/validate_consistency.py | 297 ++++++++ 9 files changed, 2081 insertions(+) create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile create mode 100644 tools/testing/selftests/filesystems/ceph/README.md create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_c= ases.sh create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation= .sh create mode 100644 tools/testing/selftests/filesystems/ceph/settings create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consi= stency.py diff --git a/MAINTAINERS b/MAINTAINERS index d1cc0e12fe1f..87c36a26c1f2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5917,6 +5917,7 @@ B: https://tracker.ceph.com/ T: git https://github.com/ceph/ceph-client.git F: Documentation/filesystems/ceph.rst F: fs/ceph/ +F: tools/testing/selftests/filesystems/ceph/ =20 CERTIFICATE HANDLING M: David Howells diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Mak= efile index 450f13ba4cca..81c01a7062e0 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -32,6 +32,7 @@ TARGETS +=3D exec TARGETS +=3D fchmodat2 TARGETS +=3D filesystems TARGETS +=3D filesystems/binderfs +TARGETS +=3D filesystems/ceph TARGETS +=3D filesystems/epoll TARGETS +=3D filesystems/fat TARGETS +=3D filesystems/overlayfs diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/test= ing/selftests/filesystems/ceph/Makefile new file mode 100644 index 000000000000..3ad768bc8420 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_PROGS :=3D run_validation.sh +TEST_FILES :=3D reset_stress.sh reset_corner_cases.sh \ + validate_consistency.py README.md settings + +include ../../lib.mk diff --git a/tools/testing/selftests/filesystems/ceph/README.md b/tools/tes= ting/selftests/filesystems/ceph/README.md new file mode 100644 index 000000000000..47931edf52b0 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/README.md @@ -0,0 +1,84 @@ +# CephFS Client Reset Test Suite + +Test suite for the CephFS kernel client manual session reset feature. +This trimmed set contains the single-client stress test, the targeted +corner-case test, and the one-shot validation harness used during +feature bring-up. + +## Prerequisites + +- Linux kernel with the CephFS client reset feature (this branch) +- A running Ceph cluster with at least one MDS +- Root access (debugfs requires it) +- Python 3 (for validators) +- flock utility (for lock tests, usually in util-linux) + +## Test inventory + +| Test | Script(s) | What it covers | +|------|-----------|----------------| +| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity= on one mount | +| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclai= m, unmount-during-reset | +| Validation harness | `run_validation.sh` | baseline + corner cases + mod= erate/aggressive stress + final status check | + +## Quick start + +Stress run: + + sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate + +Corner cases: + + sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs + +End-to-end validation: + + sudo ./run_validation.sh --mount-point /mnt/cephfs + +## Stress profiles + + baseline - no resets, 1 IO + 1 rename, 600s + moderate - reset every 5-15s, 2 IO + 1 rename, 900s + aggressive - reset every 1-5s, 4 IO + 2 rename, 900s + soak - reset every 5-15s, 2 IO + 1 rename, 3600s + +## Key options (all scripts) + + --mount-point PATH CephFS mount point (required) + --client-id ID Debugfs client id (auto-detected if one) + +reset_stress.sh additionally accepts: + + --profile NAME baseline|moderate|aggressive|soak + --duration-sec N Override profile runtime + --no-reset Disable reset injection + --out-dir PATH Artifact directory + +## Corner case tests + + [1/4] ebusy_rejection Second reset rejected while first in-flight + [2/4] dirty_caps_at_reset Reset with unflushed dirty caps + [3/4] flock_after_reset Stale lock EIO + fresh lock after holder e= xit + [4/4] unmount_during_reset umount during active reset (ESHUTDOWN path) + +Test 4 requires creating a second CephFS mount instance and SKIPs if +the host cannot do so. See `--help` output for details. + +## Troubleshooting + +**No writable Ceph reset interface found:** +Kernel lacks the reset feature, debugfs not mounted, or not root. +Check: `ls /sys/kernel/debug/ceph/*/reset/` + +**Multiple Ceph clients found:** +Use `--client-id` to select one. +List: `ls /sys/kernel/debug/ceph/` + +## Files + +| File | Role | +|------|------| +| `reset_stress.sh` | Single-client stress test runner | +| `validate_consistency.py` | Single-client post-run validator | +| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) | +| `run_validation.sh` | One-shot validation harness | diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh= b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh new file mode 100755 index 000000000000..a6dae84a616d --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh @@ -0,0 +1,646 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset corner case tests. +# Runs a checklist of targeted tests that exercise specific reset +# code paths not covered by the stress tests. +# +# Requires: mounted CephFS, debugfs access (root), flock(1) utility. + +set -uo pipefail + +KSFT_SKIP=3D4 + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +DEBUGFS_CLIENT=3D"" +TRIGGER_PATH=3D"" +STATUS_PATH=3D"" +TEMP_MNT=3D"" + +PASS_COUNT=3D0 +FAIL_COUNT=3D0 +SKIP_COUNT=3D0 +TOTAL=3D4 + +log() +{ + printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1" +} + +result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"${4:-}" + + case "$status" in + PASS) PASS_COUNT=3D$((PASS_COUNT + 1)) ;; + FAIL) FAIL_COUNT=3D$((FAIL_COUNT + 1)) ;; + SKIP) SKIP_COUNT=3D$((SKIP_COUNT + 1)) ;; + esac + + if [[ -n "$detail" ]]; then + printf '[%d/%d] %-30s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$de= tail" + else + printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status" + fi +} + +read_status_field() +{ + local field=3D"$1" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$STATUS_PATH" 2>= /dev/null +} + +wait_reset_done() +{ + local timeout=3D"${1:-30}" + local elapsed=3D0 + + while [[ "$(read_status_field "phase")" !=3D "idle" ]]; do + sleep 1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge "$timeout" ]]; then + return 1 + fi + done + return 0 +} + +list_reset_clients() +{ + local entry + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + printf '%s\n' "$entry" + done +} + +wait_status_nonidle() +{ + local status_path=3D"$1" + local timeout=3D"${2:-10}" + local polls=3D$((timeout * 10)) + local phase + + while [[ "$polls" -gt 0 ]]; do + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$status_path" 2>/d= ev/null)" + if [[ -n "$phase" && "$phase" !=3D "idle" ]]; then + return 0 + fi + sleep 0.1 + polls=3D$((polls - 1)) + done + + return 1 +} + +discover_debugfs() +{ + local candidates=3D() + local entry + + if [[ -n "$DEBUGFS_CLIENT" ]]; then + if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then + echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + candidates+=3D("$entry") + done + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + if [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-= id." >&2 + exit "$KSFT_SKIP" + fi + + DEBUGFS_CLIENT=3D"${candidates[0]}" +} + +# --- Test 1: ebusy_rejection --------------------------------------------= ---- +# +# Trigger a reset while another is guaranteed in-flight. Creates +# dirty state so the first reset enters DRAINING (which takes +# measurable time), then polls until phase !=3D idle and issues the +# second trigger. The second trigger must fail (the kernel returns +# -EBUSY), and only one reset must be counted in the accounting. + +test_ebusy_rejection() +{ + local num=3D1 + local name=3D"ebusy_rejection" + local testfile=3D"$MOUNT_POINT/.reset_corner_ebusy_$$" + local tc_before tc_after sc_before sc_after second_rc phase elapsed + + tc_before=3D"$(read_status_field "trigger_count")" + sc_before=3D"$(read_status_field "success_count")" + + # Create dirty state so the first reset enters DRAINING + echo "ebusy_dirty_data" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_ebusy_test\n') +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + # Trigger the first reset -- it will drain dirty state + echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "first trigger failed" + rm -f "$testfile" + return + } + + # Poll until phase is non-idle (quiescing or draining) + elapsed=3D0 + while true; do + phase=3D"$(read_status_field "phase")" + if [[ "$phase" !=3D "idle" ]]; then + break + fi + sleep 0.1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge 50 ]]; then + result "$num" "$name" SKIP \ + "first reset completed before overlap could be tested" + rm -f "$testfile" 2>/dev/null + return + fi + done + + # Issue the second trigger -- should be rejected with EBUSY + second_rc=3D0 + echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=3D0 || sec= ond_rc=3D$? + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "first reset never completed" + rm -f "$testfile" + return + fi + + tc_after=3D"$(read_status_field "trigger_count")" + sc_after=3D"$(read_status_field "success_count")" + + if [[ "$((tc_after - tc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$((sc_after - sc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$second_rc" -eq 0 ]]; then + result "$num" "$name" FAIL "second trigger did not return error" + rm -f "$testfile" + return + fi + + rm -f "$testfile" 2>/dev/null + result "$num" "$name" PASS +} + +# --- Test 2: dirty_caps_at_reset ----------------------------------------= ---- +# +# Write to a file without fsync (dirty caps), trigger reset, then +# verify the file is not corrupt. Manual reset drains dirty caps +# before teardown (best-effort, 5s timeout). For a non-stuck cap +# the dirty write should be flushed during drain and persist. +# If the drain window is too short, only the synced first line +# persists -- that is acceptable (data loss is documented for +# unflushed writes). + +test_dirty_caps_at_reset() +{ + local num=3D2 + local name=3D"dirty_caps_at_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_dirty_caps_$$" + local content_after line_count sc_before sc_after le + + sc_before=3D"$(read_status_field "success_count")" + + echo "line_1_before_dirty_write" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'line_2_dirty_no_fsync\n') +# deliberately no fsync -- leave caps dirty +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + result "$num" "$name" FAIL "success_count did not increment (reset not e= xercised)" + rm -f "$testfile" + return + fi + + sync "$testfile" 2>/dev/null || true + content_after=3D"$(cat "$testfile" 2>/dev/null)" || { + result "$num" "$name" FAIL "cannot read file after reset" + rm -f "$testfile" + return + } + + if [[ -z "$content_after" ]]; then + result "$num" "$name" FAIL "file is empty after reset" + rm -f "$testfile" + return + fi + + line_count=3D"$(echo "$content_after" | wc -l)" + if [[ "$line_count" -lt 1 ]]; then + result "$num" "$name" FAIL "file has $line_count lines, expected >=3D 1" + rm -f "$testfile" + return + fi + + echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || { + result "$num" "$name" FAIL "first line corrupted" + rm -f "$testfile" + return + } + + le=3D"$(read_status_field "last_errno")" + if [[ "$le" !=3D "0" ]]; then + result "$num" "$name" FAIL "last_errno=3D$le, expected 0" + rm -f "$testfile" + return + fi + + rm -f "$testfile" + result "$num" "$name" PASS "file intact ($line_count lines)" +} + +# --- Test 3: flock_after_reset ------------------------------------------= ---- +# +# Take an exclusive flock, trigger reset, verify stale lock state is +# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns +# EIO). After the original holder exits (releasing the local lock +# reference and clearing the error flag), a fresh lock can be acquired. +# +# The lock holder uses the fd-based flock form with exec, so killing +# $lock_pid closes the lock fd immediately (no orphaned child with an +# inherited fd copy that would prevent the VFS flock release). + +test_flock_after_reset() +{ + local num=3D3 + local name=3D"flock_after_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_flock_$$" + local lock_pid probe_rc sc_before sc_after + + sc_before=3D"$(read_status_field "success_count")" + + echo "flock_test_content" > "$testfile" + sync "$testfile" + + # Hold lock via fd in a subshell; exec ensures killing $lock_pid + # closes the lock fd directly (no fork/child fd inheritance). + ( + exec 9<"$testfile" + flock --exclusive --nonblock 9 || exit 1 + exec sleep 300 + ) & + lock_pid=3D$! + sleep 0.5 + + if ! kill -0 "$lock_pid" 2>/dev/null; then + result "$num" "$name" FAIL "flock holder died immediately" + rm -f "$testfile" + return + fi + + echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || { + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "success_count did not increment" + rm -f "$testfile" + return + fi + + # After teardown, CEPH_I_ERROR_FILELOCK is set on the inode. + # A same-client lock attempt should fail (EIO), NOT succeed. + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=3D0= || probe_rc=3D$? + if [[ "$probe_rc" -eq 0 ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL \ + "same-client probe succeeded, expected EIO from stale lock state" + rm -f "$testfile" + return + fi + + # Kill the holder -- the exec'd sleep IS $lock_pid, so killing it + # closes fd 9 directly. VFS flock release fires ceph_fl_release_lock(), + # which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK. + kill "$lock_pid" 2>/dev/null + wait "$lock_pid" 2>/dev/null + + # After the holder exits, a fresh lock should be acquirable. + # The reset teardown sends SESSION_REQUEST_CLOSE so the MDS + # releases locks promptly, but retry briefly in case the + # message races with the connection close. + local attempt + probe_rc=3D1 + for attempt in 1 2 3 4 5; do + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null \ + && probe_rc=3D0 || probe_rc=3D$? + [[ "$probe_rc" -eq 0 ]] && break + sleep 1 + done + if [[ "$probe_rc" -ne 0 ]]; then + result "$num" "$name" FAIL \ + "cannot acquire fresh lock after holder exit (rc=3D$probe_rc, ${attempt= } attempts)" + rm -f "$testfile" + return + fi + + # Verify file content survived + grep -q "flock_test_content" "$testfile" 2>/dev/null || { + result "$num" "$name" FAIL "file content corrupted after reset" + rm -f "$testfile" + return + } + + rm -f "$testfile" + result "$num" "$name" PASS "stale lock detected, fresh lock acquired afte= r holder exit" +} + +# --- Test 4: unmount_during_reset ---------------------------------------= ---- +# +# Mount a fresh CephFS, trigger reset, immediately unmount. The +# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN +# and not hang. + +test_unmount_during_reset() +{ + local num=3D4 + local name=3D"unmount_during_reset" + local temp_mnt=3D"/tmp/ceph_corner_mnt_$$" + local mount_opts=3D"" + local mount_src=3D"" + local temp_trigger=3D"" + local temp_status=3D"" + local temp_client=3D"" + local temp_file=3D"$temp_mnt/.reset_corner_umount_$$" + local phase=3D"" + local trigger_ok=3D0 + local attempt + local -a new_clients=3D() + declare -A existing_clients=3D() + + mount_src=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "cep= h" {print $1; exit}' /proc/mounts 2>/dev/null)" + mount_opts=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "ce= ph" {print $4; exit}' /proc/mounts 2>/dev/null)" + + if [[ -z "$mount_src" ]]; then + result "$num" "$name" SKIP "cannot determine mount source from /proc/mou= nts" + return + fi + + while IFS=3D read -r existing; do + [[ -n "$existing" ]] || continue + existing_clients["$existing"]=3D1 + done < <(list_reset_clients) + + mkdir -p "$temp_mnt" + + if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null;= then + result "$num" "$name" SKIP "cannot mount additional CephFS instance" + rmdir "$temp_mnt" 2>/dev/null + return + fi + + ls "$temp_mnt" > /dev/null 2>&1 + sync + sleep 1 + + for attempt in $(seq 1 50); do + new_clients=3D() + while IFS=3D read -r entry; do + [[ -n "$entry" ]] || continue + if [[ -n "${existing_clients[$entry]+x}" ]]; then + continue + fi + new_clients+=3D("$entry") + done < <(list_reset_clients) + + if [[ "${#new_clients[@]}" -eq 1 ]]; then + temp_client=3D"${new_clients[0]}" + break + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + break + fi + + sleep 0.1 + done + + if [[ -z "$temp_client" ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "cannot identify debugfs client for temp moun= t" + return + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "multiple new debugfs clients appeared" + return + fi + + temp_trigger=3D"$DEBUGFS_ROOT/$temp_client/reset/trigger" + temp_status=3D"$DEBUGFS_ROOT/$temp_client/reset/status" + + echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot create dirty state on temp mount" + return + } + sync "$temp_file" + python3 -c " +import os, sys +fd =3D os.open('$temp_file', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_umount_test\\n') +os.close(fd) +" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap" + return + } + + echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=3D1 || tr= igger_ok=3D0 + if [[ "$trigger_ok" -ne 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot trigger reset on temp mount" + return + fi + + if ! wait_status_nonidle "$temp_status" 10; then + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$temp_status" 2>/d= ev/null)" + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL \ + "reset never became active before umount (phase=3D${phase:-unknown})" + return + fi + + local umount_ok=3D0 + timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=3D1 + + if [[ "$umount_ok" -ne 1 ]]; then + umount -l "$temp_mnt" 2>/dev/null || true + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "umount hung for >30s" + return + fi + + rmdir "$temp_mnt" 2>/dev/null + + ls "$MOUNT_POINT" > /dev/null 2>&1 || { + result "$num" "$name" FAIL "original mount unhealthy after test" + return + } + + result "$num" "$name" PASS +} + +# --- Main ---------------------------------------------------------------= ----- + +usage() +{ + cat < [--client-id ] [--debugfs-root ] + +Runs targeted corner-case tests for the CephFS client reset feature. +Requires root (debugfs access) and a mounted CephFS filesystem. + +Options: + --mount-point PATH CephFS mount point (required) + --client-id ID Ceph debugfs client id (auto-detect if one client) + --debugfs-root PATH Debugfs ceph root (default: /sys/kernel/debug/cep= h) + --help Show this message +EOF +} + +main() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --client-id) DEBUGFS_CLIENT=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac + done + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + discover_debugfs + TRIGGER_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger" + STATUS_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status" + + log "CephFS client reset corner case tests" + log "Mount: $MOUNT_POINT" + log "Client: $DEBUGFS_CLIENT" + echo "" + + test_ebusy_rejection + test_dirty_caps_at_reset + test_flock_after_reset + test_unmount_during_reset + + echo "" + echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skippe= d (of $TOTAL)" + + if [[ "$FAIL_COUNT" -gt 0 ]]; then + exit 1 + fi + exit 0 +} + +main "$@" diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/too= ls/testing/selftests/filesystems/ceph/reset_stress.sh new file mode 100755 index 000000000000..c503c75a5f7a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh @@ -0,0 +1,694 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS reset stress test: +# - Runs concurrent I/O and rename workloads +# - Triggers random client resets through debugfs +# - Validates consistency and recovery behavior + +set -euo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +PROFILE=3D"moderate" +DURATION_SEC=3D"" +COOLDOWN_SEC=3D20 +FILE_COUNT=3D64 +IO_WORKERS=3D"" +RENAME_WORKERS=3D"" +MOUNT_POINT=3D"" +OUT_DIR=3D"" +CLIENT_ID=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +SLO_SECONDS=3D30 +EXPECT_RESET=3D1 +DMESG_CMD=3D"" +SUDO=3D"" + +RESET_MIN_SEC=3D5 +RESET_MAX_SEC=3D15 + +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +WORKLOAD_FLAG=3D"" +RESET_FLAG=3D"" +DATA_DIR=3D"" + +IO_LOG=3D"" +RENAME_LOG=3D"" +RESET_LOG=3D"" +STATUS_LOG=3D"" +STATUS_BEFORE=3D"" +STATUS_FINAL=3D"" +DMESG_LOG=3D"" +SUMMARY_LOG=3D"" +REPORT_JSON=3D"" + +RESET_PID=3D0 +STATUS_PID=3D0 +declare -a IO_WORKER_PIDS=3D() +declare -a RENAME_WORKER_PIDS=3D() + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point to test under + +Options: + --profile NAME baseline|moderate|aggressive|soak (default: mod= erate) + --duration-sec N Override profile runtime in seconds + --cooldown-sec N Workload drain time after injector stop (defaul= t: 20) + --file-count N Number of logical files (default: 64) + --io-workers N Number of concurrent I/O workers (profile defau= lt) + --rename-workers N Number of concurrent rename workers (profile de= fault) + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_st= ress_) + --client-id ID Ceph debugfs client id; auto-detect if one clie= nt exists + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/c= eph) + --slo-seconds N Max allowed post-reset stall window (default: 3= 0) + --no-reset Disable reset injector (baseline mode helper) + --help Show this message + +Examples: + $0 --mount-point /mnt/cephfs --profile moderate + $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300 + $0 --mount-point /mnt/cephfs --profile baseline --no-reset +EOF +} + +now_ms() +{ + date +%s%3N +} + +set_profile_defaults() +{ + case "$PROFILE" in + baseline) + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + EXPECT_RESET=3D0 + : "${DURATION_SEC:=3D600}" + : "${IO_WORKERS:=3D1}" + : "${RENAME_WORKERS:=3D1}" + ;; + moderate) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + aggressive) + RESET_MIN_SEC=3D1 + RESET_MAX_SEC=3D5 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D4}" + : "${RENAME_WORKERS:=3D2}" + ;; + soak) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D3600}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + *) + echo "Unknown profile: $PROFILE" >&2 + exit 2 + ;; + esac +} + +log_summary() +{ + local msg=3D"$1" + printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUM= MARY_LOG" +} + +discover_client_id() +{ + local candidates=3D() + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then + echo "SKIP: reset debugfs not found for client-id=3D$CLIENT_ID" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + if ! $SUDO test -d "$DEBUGFS_ROOT"; then + echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2 + exit "$KSFT_SKIP" + fi + + while IFS=3D read -r entry; do + $SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue + $SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue + candidates+=3D("$entry") + done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true) + + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + return 0 + fi + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-= id." >&2 + exit "$KSFT_SKIP" +} + +init_dataset() +{ + local i + mkdir -p "$DATA_DIR/A" "$DATA_DIR/B" + + for ((i =3D 0; i < FILE_COUNT; i++)); do + printf 'seed logical_id=3D%05d ts_ms=3D%s\n' "$i" "$(now_ms)" > "$DATA_D= IR/A/file_$(printf '%05d' "$i")" + done +} + +io_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local relpath + local abspath + local payload + local hash + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + relpath=3D"A/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + relpath=3D"B/file_$id" + else + sleep 0.02 + continue + fi + + abspath=3D"$DATA_DIR/$relpath" + alt_relpath=3D"" + if [[ "$relpath" =3D=3D A/* ]]; then + alt_relpath=3D"B/file_$id" + else + alt_relpath=3D"A/file_$id" + fi + alt_abspath=3D"$DATA_DIR/$alt_relpath" + payload=3D"worker=3D${worker_id} io_seq=3D${seq} id=3D${id} ts_ms=3D$(no= w_ms)" + result=3D"$( + python3 - "$abspath" "$alt_abspath" "$payload" <<'PY' +import hashlib +import os +import sys + +path =3D sys.argv[1] +alt_path =3D sys.argv[2] +payload =3D sys.argv[3] + +try: + fd =3D os.open(path, os.O_RDWR | os.O_APPEND) + actual =3D path +except FileNotFoundError: + try: + fd =3D os.open(alt_path, os.O_RDWR | os.O_APPEND) + actual =3D alt_path + except FileNotFoundError: + sys.exit(1) + +try: + os.write(fd, (payload + "\n").encode()) + os.fsync(fd) + os.lseek(fd, 0, os.SEEK_SET) + digest =3D hashlib.sha256() + while True: + chunk =3D os.read(fd, 1 << 20) + if not chunk: + break + digest.update(chunk) + print(actual + " " + digest.hexdigest()) +finally: + os.close(fd) +PY + )" || { + sleep 0.02 + continue + } + + actual_abspath=3D"${result%% *}" + hash=3D"${result#* }" + if [[ "$actual_abspath" =3D=3D "$alt_abspath" ]]; then + relpath=3D"$alt_relpath" + fi + + ts=3D"$(now_ms)" + printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_= LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +rename_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local src_rel + local dst_rel + local rc + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + src_rel=3D"A/file_$id" + dst_rel=3D"B/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + src_rel=3D"B/file_$id" + dst_rel=3D"A/file_$id" + else + sleep 0.02 + continue + fi + + ts=3D"$(now_ms)" + if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_re= l" "$dst_rel" "$rc" >> "$RENAME_LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +random_sleep_seconds() +{ + local min_sec=3D"$1" + local max_sec=3D"$2" + local wait_sec + local span + + span=3D$((max_sec - min_sec + 1)) + wait_sec=3D$((min_sec + RANDOM % span)) + sleep "$wait_sec" +} + +reset_injector() +{ + set +e + local trigger_path=3D"$1" + local seq=3D0 + local ts + local reason + local rc + + while [[ -f "$RESET_FLAG" ]]; do + random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC" + [[ -f "$RESET_FLAG" ]] || break + + ts=3D"$(now_ms)" + reason=3D"stress_${seq}_${ts}" + if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG" + seq=3D$((seq + 1)) + done +} + +status_sampler() +{ + set +e + local status_path=3D"$1" + local ts + local kv_line + + while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do + ts=3D"$(now_ms)" + if $SUDO test -r "$status_path"; then + kv_line=3D"$($SUDO awk -F': ' 'NF>=3D2 {gsub(/ /, "", $1); gsub(/ /, ""= , $2); printf "%s=3D%s;", $1, $2}' "$status_path")" + printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG" + fi + sleep 1 + done +} + +stop_pid_with_timeout() +{ + local pid=3D"$1" + local name=3D"$2" + local timeout=3D"$3" + local waited=3D0 + + if [[ "$pid" -le 0 ]]; then + return 0 + fi + + while kill -0 "$pid" 2>/dev/null; do + if (( waited >=3D timeout )); then + log_summary "Timeout waiting for $name (pid=3D$pid), sending SIGTERM/SI= GKILL" + kill -TERM "$pid" 2>/dev/null || true + sleep 1 + kill -KILL "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + return 1 + fi + sleep 1 + waited=3D$((waited + 1)) + done + + wait "$pid" 2>/dev/null || true + return 0 +} + +detect_privileges() +{ + if [[ -r "$DEBUGFS_ROOT" ]]; then + SUDO=3D"" + elif sudo -n true 2>/dev/null; then + SUDO=3D"sudo" + else + echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is no= t available" >&2 + echo "WARNING: reset injection, debugfs status checks, and dmesg capture= will not work" >&2 + fi + + if $SUDO dmesg > /dev/null 2>&1; then + DMESG_CMD=3D"$SUDO dmesg" + else + DMESG_CMD=3D"" + echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will = not be detected" >&2 + fi +} + +check_dmesg() +{ + local start_epoch=3D"$1" + + if [[ -z "$DMESG_CMD" ]]; then + return 0 + fi + + if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then + if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then + log_summary "WARNING: dmesg capture failed unexpectedly" + return 0 + fi + log_summary "dmesg --since unsupported; captured full dmesg" + fi + + if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then + log_summary "ERROR: kernel log contains 'hung task' during test window" + return 1 + fi + + return 0 +} + +cleanup() +{ + rm -f "$WORKLOAD_FLAG" "$RESET_FLAG" + local pid + for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID"= "$STATUS_PID"; do + [[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true + done + wait 2>/dev/null || true +} + +parse_args() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) + MOUNT_POINT=3D"$2" + shift 2 + ;; + --profile) + PROFILE=3D"$2" + shift 2 + ;; + --duration-sec) + DURATION_SEC=3D"$2" + shift 2 + ;; + --cooldown-sec) + COOLDOWN_SEC=3D"$2" + shift 2 + ;; + --file-count) + FILE_COUNT=3D"$2" + shift 2 + ;; + --io-workers) + IO_WORKERS=3D"$2" + shift 2 + ;; + --rename-workers) + RENAME_WORKERS=3D"$2" + shift 2 + ;; + --out-dir) + OUT_DIR=3D"$2" + shift 2 + ;; + --client-id) + CLIENT_ID=3D"$2" + shift 2 + ;; + --debugfs-root) + DEBUGFS_ROOT=3D"$2" + shift 2 + ;; + --slo-seconds) + SLO_SECONDS=3D"$2" + shift 2 + ;; + --no-reset) + EXPECT_RESET=3D0 + shift + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown option: $1" >&2 + usage + exit 2 + ;; + esac + done +} + +main() +{ + local start_epoch + local trigger_path=3D"" + local status_path=3D"" + local final_rc=3D0 + local reset_enabled=3D0 + local i + + parse_args "$@" + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + rm -f "$MOUNT_POINT/.ceph_reset_test_probe" + + if ! command -v python3 > /dev/null 2>&1; then + echo "SKIP: python3 is required but not found in PATH" >&2 + exit "$KSFT_SKIP" + fi + + if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then + echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2 + fi + + detect_privileges + + set_profile_defaults + if [[ "$EXPECT_RESET" -eq 0 ]]; then + PROFILE=3D"baseline" + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + fi + + if ! [[ "$IO_WORKERS" =3D~ ^[0-9]+$ && "$RENAME_WORKERS" =3D~ ^[0-9]+$ ]]= ; then + echo "io-workers and rename-workers must be integers" >&2 + exit 2 + fi + + if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then + echo "io-workers and rename-workers must be > 0" >&2 + exit 2 + fi + + if [[ -z "$OUT_DIR" ]]; then + OUT_DIR=3D"/tmp/ceph_reset_stress_${RUN_ID}" + fi + mkdir -p "$OUT_DIR" + + WORKLOAD_FLAG=3D"$OUT_DIR/workload.running" + RESET_FLAG=3D"$OUT_DIR/reset.running" + + DATA_DIR=3D"$MOUNT_POINT/ceph_reset_stress_${RUN_ID}" + mkdir -p "$DATA_DIR" + + IO_LOG=3D"$OUT_DIR/io.log" + RENAME_LOG=3D"$OUT_DIR/rename.log" + RESET_LOG=3D"$OUT_DIR/reset.log" + STATUS_LOG=3D"$OUT_DIR/status.log" + STATUS_BEFORE=3D"$OUT_DIR/reset_status.before" + STATUS_FINAL=3D"$OUT_DIR/reset_status.final" + DMESG_LOG=3D"$OUT_DIR/dmesg.log" + SUMMARY_LOG=3D"$OUT_DIR/summary.log" + REPORT_JSON=3D"$OUT_DIR/validator_report.json" + + : > "$IO_LOG" + : > "$RENAME_LOG" + : > "$RESET_LOG" + : > "$STATUS_LOG" + : > "$SUMMARY_LOG" + + start_epoch=3D"$(date +%s)" + + log_summary "Starting Ceph reset stress test" + log_summary "Profile=3D$PROFILE duration=3D${DURATION_SEC}s cooldown=3D${= COOLDOWN_SEC}s file_count=3D${FILE_COUNT} io_workers=3D${IO_WORKERS} rename= _workers=3D${RENAME_WORKERS}" + [[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations" + [[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung = task detection disabled" + log_summary "Artifacts=3D$OUT_DIR" + log_summary "Data dir=3D$DATA_DIR" + + init_dataset + + if [[ "$EXPECT_RESET" -eq 1 ]]; then + discover_client_id + trigger_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger" + status_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + if ! $SUDO test -w "$trigger_path"; then + echo "SKIP: Reset trigger is not writable: $trigger_path" >&2 + exit "$KSFT_SKIP" + fi + if ! $SUDO test -r "$status_path"; then + echo "SKIP: Reset status is not readable: $status_path" >&2 + exit "$KSFT_SKIP" + fi + $SUDO cat "$status_path" > "$STATUS_BEFORE" || true + reset_enabled=3D1 + log_summary "Using ceph client id: $CLIENT_ID" + fi + + trap cleanup EXIT INT TERM + + touch "$WORKLOAD_FLAG" + for ((i =3D 0; i < IO_WORKERS; i++)); do + io_worker "$i" & + IO_WORKER_PIDS+=3D("$!") + done + + for ((i =3D 0; i < RENAME_WORKERS; i++)); do + rename_worker "$i" & + RENAME_WORKER_PIDS+=3D("$!") + done + + if [[ "$reset_enabled" -eq 1 ]]; then + touch "$RESET_FLAG" + reset_injector "$trigger_path" & + RESET_PID=3D$! + + status_sampler "$status_path" & + STATUS_PID=3D$! + fi + + sleep "$DURATION_SEC" + + if [[ "$reset_enabled" -eq 1 ]]; then + rm -f "$RESET_FLAG" + stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=3D1 + log_summary "Injector stopped; entering cooldown=3D${COOLDOWN_SEC}s" + fi + + sleep "$COOLDOWN_SEC" + + rm -f "$WORKLOAD_FLAG" + for i in "${!IO_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || fina= l_rc=3D1 + done + for i in "${!RENAME_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20= || final_rc=3D1 + done + + if [[ "$reset_enabled" -eq 1 ]]; then + stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=3D1 + $SUDO cat "$status_path" > "$STATUS_FINAL" || true + fi + + if ! check_dmesg "$start_epoch"; then + final_rc=3D1 + fi + + if ! python3 "$SCRIPT_DIR/validate_consistency.py" \ + --data-dir "$DATA_DIR" \ + --file-count "$FILE_COUNT" \ + --io-log "$IO_LOG" \ + --rename-log "$RENAME_LOG" \ + --reset-log "$RESET_LOG" \ + --status-final "$STATUS_FINAL" \ + --slo-seconds "$SLO_SECONDS" \ + --report-json "$REPORT_JSON" \ + $( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then + final_rc=3D1 + fi + + if [[ "$final_rc" -eq 0 ]]; then + log_summary "PASS: stress run completed successfully" + else + log_summary "FAIL: stress run detected one or more failures" + fi + + log_summary "Artifacts available in: $OUT_DIR" + exit "$final_rc" +} + +main "$@" diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/t= ools/testing/selftests/filesystems/ceph/run_validation.sh new file mode 100755 index 000000000000..5d521e4f9e9b --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh @@ -0,0 +1,350 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset - single-command validation. +# Runs all test stages in sequence with per-stage timeouts. +# If any stage hangs (filesystem stuck, process blocked), the +# timeout kills it and reports failure. +# +# Usage: +# sudo ./run_validation.sh --mount-point /mnt/mycephfs +# +# Expected output on success: +# +# =3D=3D=3D CephFS Client Reset Validation =3D=3D=3D +# [stage 1/5] baseline PASS (60s, no resets) +# [stage 2/5] corner_cases PASS (4/4 passed) +# [stage 3/5] moderate PASS (120s, resets every 5-15s) +# [stage 4/5] aggressive PASS (120s, resets every 1-5s) +# [stage 5/5] status_check PASS (phase=3Didle, last_errno=3D0) +# +# RESULT: 5/5 stages passed +# Artifacts: /tmp/ceph_reset_validation_ + +set -uo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +CLIENT_ID=3D"" +declare -a CLIENT_ARGS=3D() +declare -a DEBUGFS_ARGS=3D() +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +OUT_DIR=3D"/tmp/ceph_reset_validation_${RUN_ID}" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" + +# Timeout margins: stage runtime + cooldown + validation + safety buffer +STAGE1_TIMEOUT=3D120 # 60s run + 20s cooldown + 40s buffer +STAGE2_TIMEOUT=3D300 # 4 corner cases, 30s each worst case + buffer +STAGE3_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE4_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE5_TIMEOUT=3D10 # just reading debugfs + +PASS=3D0 +FAIL=3D0 +TOTAL=3D5 + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point + +Options: + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_valid= ation_) + --client-id ID Ceph debugfs client id (optional) + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph) + --help Show this message +EOF +} + +stage_result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"$4" + + if [[ "$status" =3D=3D "PASS" ]]; then + PASS=3D$((PASS + 1)) + else + FAIL=3D$((FAIL + 1)) + fi + printf '[stage %d/%d] %-16s %s (%s)\n' "$num" "$TOTAL" "$name" "$status"= "$detail" +} + +# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout. +# Sets RUN_TIMED_OUT=3D1 if killed by timeout. +# +# The stage command runs in its own session/process group (via setsid). +# On timeout the entire process group is killed, not just the top-level +# script PID. This is required because stage scripts (reset_stress.sh, +# reset_corner_cases.sh) spawn child processes - I/O workers, rename +# workers, reset injectors, samplers - that would otherwise survive the +# timeout and bleed into later stages, invalidating results. +RUN_TIMED_OUT=3D0 + +run_with_timeout() +{ + local timeout_sec=3D"$1" + local logfile=3D"$2" + shift 2 + + RUN_TIMED_OUT=3D0 + + # Start the stage in its own session via setsid so all descendant + # processes share a process group that we can kill atomically. + # In a non-interactive script, background children are not process + # group leaders, so setsid(1) calls setsid(2) directly (no extra + # fork) and the PID we capture IS the group leader. + setsid "$@" > "$logfile" 2>&1 & + local pid=3D$! + + # Watchdog: on timeout, kill the entire process group + ( + sleep "$timeout_sec" + if kill -0 "$pid" 2>/dev/null; then + echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $p= id" >> "$logfile" + kill -TERM -- -"$pid" 2>/dev/null + sleep 2 + kill -KILL -- -"$pid" 2>/dev/null + fi + ) & + local watchdog_pid=3D$! + + # Wait for the stage command + wait "$pid" 2>/dev/null + local rc=3D$? + + # Kill the watchdog if it's still running + kill "$watchdog_pid" 2>/dev/null + wait "$watchdog_pid" 2>/dev/null + + # Check if it was killed by timeout + if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then + RUN_TIMED_OUT=3D1 + return 1 + fi + + return "$rc" +} + +find_status_path() +{ + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then + echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + return 0 + fi + return 1 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + if [[ -r "${entry}reset/status" ]]; then + echo "${entry}reset/status" + return 0 + fi + done + return 1 +} + +read_status_field() +{ + local status_path=3D"$1" + local field=3D"$2" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$status_path" 2>= /dev/null +} + +# --- Parse arguments ----------------------------------------------------= --- + +while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --out-dir) OUT_DIR=3D"$2"; shift 2 ;; + --client-id) CLIENT_ID=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac +done + +if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: --mount-point is required" >&2 + usage + exit "$KSFT_SKIP" +fi + +if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi + +# Auto-detect client id when not specified, so all stages (including +# stage 5 status check) use the same client consistently. +if [[ -z "$CLIENT_ID" ]]; then + candidates=3D() + for entry in "$DEBUGFS_ROOT"/*/; do + name=3D"$(basename "$entry")" + if [[ -r "${entry}reset/status" ]]; then + candidates+=3D("$name") + fi + done + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + elif [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client= -id." >&2 + exit "$KSFT_SKIP" + fi +fi + +if [[ -n "$CLIENT_ID" ]]; then + CLIENT_ARGS=3D(--client-id "$CLIENT_ID") +fi +DEBUGFS_ARGS=3D(--debugfs-root "$DEBUGFS_ROOT") + +# Quick sanity: can we write to the mount? +if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi +rm -f "$MOUNT_POINT/.validation_probe_$$" + +mkdir -p "$OUT_DIR" + +echo "" +echo "=3D=3D=3D CephFS Client Reset Validation =3D=3D=3D" +echo "" + +# --- Stage 1: Baseline (no resets) --------------------------------------= --- + +stage1_out=3D"$OUT_DIR/stage1_baseline" +if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile baseline \ + --no-reset \ + --duration-sec 60 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage1_out"; then + stage_result 1 "baseline" "PASS" "60s, no resets" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s" +else + stage_result 1 "baseline" "FAIL" "see $stage1_out.log" +fi + +# --- Stage 2: Corner cases ----------------------------------------------= --- + +stage2_out=3D"$OUT_DIR/stage2_corner_cases" +mkdir -p "$stage2_out" +if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \ + "$SCRIPT_DIR/reset_corner_cases.sh" \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --mount-point "$MOUNT_POINT"; then + pass_line=3D$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$s= tage2_out/output.log" | tail -1) + stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT= }s" +else + fail_line=3D$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo= "?") + stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_= out/output.log" +fi + +# --- Stage 3: Moderate resets -------------------------------------------= ---- + +stage3_out=3D"$OUT_DIR/stage3_moderate" +if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile moderate \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage3_out"; then + stage_result 3 "moderate" "PASS" "120s, resets every 5-15s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s" +else + stage_result 3 "moderate" "FAIL" "see $stage3_out.log" +fi + +# --- Stage 4: Aggressive resets -----------------------------------------= ---- + +stage4_out=3D"$OUT_DIR/stage4_aggressive" +if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile aggressive \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage4_out"; then + stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s" +else + stage_result 4 "aggressive" "FAIL" "see $stage4_out.log" +fi + +# --- Stage 5: Post-run status check -------------------------------------= --- + +status_path=3D"" +if status_path=3D$(find_status_path); then + phase=3D$(read_status_field "$status_path" "phase") + last_errno=3D$(read_status_field "$status_path" "last_errno") + failure_count=3D$(read_status_field "$status_path" "failure_count") + drain_timed_out=3D$(read_status_field "$status_path" "drain_timed_out") + sessions_reset=3D$(read_status_field "$status_path" "sessions_reset") + blocked=3D$(read_status_field "$status_path" "blocked_requests") + + # Save full status + cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null + + errors=3D"" + [[ "$phase" !=3D "idle" ]] && errors=3D"${errors}phase=3D$phase " + [[ "$last_errno" !=3D "0" ]] && errors=3D"${errors}last_errno=3D$last_err= no " + [[ "$failure_count" !=3D "0" && -n "$failure_count" ]] && errors=3D"${err= ors}failure_count=3D$failure_count " + [[ "$blocked" !=3D "0" ]] && errors=3D"${errors}blocked_requests=3D$block= ed " + + if [[ -z "$errors" ]]; then + detail=3D"phase=3D$phase, last_errno=3D$last_errno, failure_count=3D${fa= ilure_count:-0}" + [[ "$drain_timed_out" =3D=3D "yes" ]] && detail=3D"$detail, drain_timed_= out=3Dyes" + [[ -n "$sessions_reset" ]] && detail=3D"$detail, sessions_reset=3D$sessi= ons_reset" + stage_result 5 "status_check" "PASS" "$detail" + else + stage_result 5 "status_check" "FAIL" "$errors" + fi +else + stage_result 5 "status_check" "FAIL" "cannot read reset/status" +fi + +# --- Summary ------------------------------------------------------------= ---- + +echo "" +if [[ "$FAIL" -eq 0 ]]; then + echo "RESULT: $PASS/$TOTAL stages passed" +else + echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED" +fi +echo "Artifacts: $OUT_DIR" +echo "" + +exit "$FAIL" diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/test= ing/selftests/filesystems/ceph/settings new file mode 100644 index 000000000000..79b65bdf05db --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/settings @@ -0,0 +1 @@ +timeout=3D1200 diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.= py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py new file mode 100755 index 000000000000..c230a59bdb3a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py @@ -0,0 +1,297 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import argparse +import bisect +import hashlib +import json +import os +from pathlib import Path + + +def sha256_file(path: Path) -> str: + digest =3D hashlib.sha256() + with path.open("rb") as handle: + while True: + chunk =3D handle.read(1 << 20) + if not chunk: + break + digest.update(chunk) + return digest.hexdigest() + + +def parse_io_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 5: + raise ValueError(f"io log line {line_no}: expected 5 colum= ns, got {len(parts)}") + ts_ms, seq, logical_id, relpath, digest =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "relpath": relpath, + "digest": digest, + } + ) + return records + + +def parse_rename_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) =3D=3D 6: + ts_ms, seq, logical_id, src_rel, dst_rel, rc =3D parts + elif len(parts) =3D=3D 7: + ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = =3D parts + _ =3D worker_id # worker id is informational only + else: + raise ValueError( + f"rename log line {line_no}: expected 6 or 7 columns, = got {len(parts)}" + ) + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "src_rel": src_rel, + "dst_rel": dst_rel, + "rc": int(rc), + } + ) + return records + + +def parse_reset_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 4: + raise ValueError(f"reset log line {line_no}: expected 4 co= lumns, got {len(parts)}") + ts_ms, seq, reason, rc =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "reason": reason, + "rc": int(rc), + } + ) + return records + + +def parse_status_file(path: Path): + status =3D {} + if not path.exists(): + return status + with path.open("r", encoding=3D"utf-8") as handle: + for line in handle: + line =3D line.strip() + if not line or ":" not in line: + continue + key, value =3D line.split(":", 1) + status[key.strip()] =3D value.strip() + return status + + +def to_int(value: str, default: int =3D 0): + try: + return int(value) + except Exception: + return default + + +def validate_namespace(data_dir: Path, file_count: int, issues): + actual_locations =3D {} + actual_paths =3D {} + for logical_id in range(file_count): + name =3D f"file_{logical_id:05d}" + found =3D [] + for subdir in ("A", "B"): + candidate =3D data_dir / subdir / name + if candidate.exists(): + found.append((subdir, candidate)) + if len(found) !=3D 1: + issues.append( + f"namespace invariant failed for logical_id=3D{logical_id:= 05d}: expected exactly one file in A/B, found {len(found)}" + ) + continue + actual_locations[logical_id] =3D found[0][0] + actual_paths[logical_id] =3D found[0][1] + return actual_locations, actual_paths + + +def validate_rename_invariant(rename_records, actual_locations, issues): + expected_locations =3D {} + for rec in rename_records: + if rec["rc"] !=3D 0: + continue + dst =3D rec["dst_rel"] + if "/" not in dst: + continue + expected_locations[rec["logical_id"]] =3D dst.split("/", 1)[0] + + for logical_id, expected in expected_locations.items(): + actual =3D actual_locations.get(logical_id) + if actual is None: + continue + if actual !=3D expected: + issues.append( + f"rename invariant failed for logical_id=3D{logical_id:05d= }: expected location=3D{expected}, actual=3D{actual}" + ) + + +def validate_data_invariant(io_records, actual_paths, issues): + expected_hash =3D {} + for rec in io_records: + digest =3D rec["digest"] + if not digest: + continue + expected_hash[rec["logical_id"]] =3D digest + + for logical_id, digest in expected_hash.items(): + path =3D actual_paths.get(logical_id) + if path is None: + continue + actual_digest =3D sha256_file(path) + if digest !=3D actual_digest: + issues.append( + f"data invariant failed for logical_id=3D{logical_id:05d}:= expected digest=3D{digest}, actual digest=3D{actual_digest}" + ) + + +def validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues): + if not args.expect_reset: + return + + successful_reset_times =3D [rec["ts_ms"] for rec in reset_records if r= ec["rc"] =3D=3D 0] + if not successful_reset_times: + issues.append("expected reset activity but no successful reset tri= gger was observed") + + phase =3D status.get("phase") + blocked_requests =3D to_int(status.get("blocked_requests", "0"), defau= lt=3D-1) + last_errno =3D to_int(status.get("last_errno", "0"), default=3D1) + failure_count =3D to_int(status.get("failure_count", "0"), default=3D-= 1) + + if phase is None: + issues.append("missing final reset status file or phase field") + elif phase.lower() !=3D "idle": + issues.append(f"recovery invariant failed: phase=3D{phase}, expect= ed idle") + + if blocked_requests !=3D 0: + issues.append(f"recovery invariant failed: blocked_requests=3D{blo= cked_requests}, expected 0") + if last_errno !=3D 0: + issues.append(f"recovery invariant failed: last_errno=3D{last_errn= o}, expected 0") + if failure_count > 0: + issues.append( + f"recovery invariant failed: failure_count=3D{failure_count}, " + "one or more resets failed during the run" + ) + + op_times =3D [rec["ts_ms"] for rec in io_records] + op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] = =3D=3D 0) + op_times.sort() + + if successful_reset_times and not op_times: + issues.append("recovery SLO failed: no workload completion events = were recorded") + return + + slo_ms =3D args.slo_seconds * 1000 + for ts in successful_reset_times: + idx =3D bisect.bisect_left(op_times, ts) + if idx >=3D len(op_times): + issues.append(f"recovery SLO failed: no operation completion o= bserved after reset at ts_ms=3D{ts}") + continue + delta =3D op_times[idx] - ts + if delta > slo_ms: + issues.append( + f"recovery SLO failed: first post-reset completion at {del= ta}ms exceeds threshold {slo_ms}ms (reset ts_ms=3D{ts})" + ) + + +def main(): + parser =3D argparse.ArgumentParser(description=3D"Validate Ceph reset = stress artifacts") + parser.add_argument("--data-dir", required=3DTrue) + parser.add_argument("--file-count", required=3DTrue, type=3Dint) + parser.add_argument("--io-log", required=3DTrue) + parser.add_argument("--rename-log", required=3DTrue) + parser.add_argument("--reset-log", required=3DTrue) + parser.add_argument("--status-final", required=3DFalse, default=3D"") + parser.add_argument("--slo-seconds", required=3DFalse, type=3Dint, def= ault=3D30) + parser.add_argument("--expect-reset", action=3D"store_true") + parser.add_argument("--report-json", required=3DFalse, default=3D"") + args =3D parser.parse_args() + + data_dir =3D Path(args.data_dir) + io_log =3D Path(args.io_log) + rename_log =3D Path(args.rename_log) + reset_log =3D Path(args.reset_log) + status_final =3D Path(args.status_final) if args.status_final else Pat= h("__missing_status__") + + issues =3D [] + + if not data_dir.exists(): + issues.append(f"data directory is missing: {data_dir}") + + try: + io_records =3D parse_io_log(io_log) + rename_records =3D parse_rename_log(rename_log) + reset_records =3D parse_reset_log(reset_log) + except Exception as exc: + issues.append(f"log parsing failed: {exc}") + io_records =3D [] + rename_records =3D [] + reset_records =3D [] + + status =3D parse_status_file(status_final) + + actual_locations, actual_paths =3D validate_namespace(data_dir, args.f= ile_count, issues) + validate_rename_invariant(rename_records, actual_locations, issues) + validate_data_invariant(io_records, actual_paths, issues) + validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues) + + report =3D { + "file_count": args.file_count, + "io_records": len(io_records), + "rename_records": len(rename_records), + "reset_records": len(reset_records), + "expect_reset": args.expect_reset, + "issues": issues, + } + + if args.report_json: + report_path =3D Path(args.report_json) + report_path.write_text(json.dumps(report, indent=3D2, sort_keys=3D= True), encoding=3D"utf-8") + + if issues: + print("FAIL: consistency validation found issues") + for issue in issues: + print(f" - {issue}") + raise SystemExit(1) + + print("PASS: consistency validation succeeded") + + +if __name__ =3D=3D "__main__": + main() --=20 2.34.1