From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D3063FB073 for ; Wed, 29 Apr 2026 12:52:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467146; cv=none; b=j2Bikk3DLI4Uic04LbQ04xA8OrYh4qRgOOYb1v+YEng05gxafbUJ/T7FndpQG+2bUuzBkPsuhbk3DPWQ6WF9L4n4Tdu6OMNRStdXRG67iNq+k50b+dZ6Pcc1XC3AmcyMBZ4IeuMgYfBxW0XjZqTuxzlLCsXgHoJP8e+2Va5eROM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467146; c=relaxed/simple; bh=bQZvJHvY7h8CiLeg0LuJ3PDTucKruNwf0pSG4a0QHAU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=nlQDqsVE1ciZqwejQ36wCi9Bx1cYA4T1A51Dvg4ERivq5A6Gl/PYlQa9CAjZkmDqbkQVJ6XslWQTPz6Cp3u0WQf+4YE1RPGaCs3kRGQKDs8nI1/c/JQM2zghA7DaH9Q12+4FJu3ZvvTVg0+xs7Lj2MAzBkAP7Vg1x/rCS0GwqzE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=OkcaPgfU; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=l8oCyElt; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="OkcaPgfU"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="l8oCyElt" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467139; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wXLUrx6pg+td5LUoEiI8ndSrVAlCTiqcgH/61RABzys=; b=OkcaPgfUcEsazHGSD8iEdYROywCZKmLc4toF9BSpwmPjGdH2WgfQowJD6ZKU/ELzV/nVUi 2ZwYAgAyE9LMj8hFOoGBunYXiaN0yyKdGg02gM/Vo7dPe13o8s0lgptfuA+YTRmC3ayrTw OZP0feNYdIkJxzUuS2gacoD3yoSRuBM= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-201-Bw242PPfPyOHgoyzhe_mmg-1; Wed, 29 Apr 2026 08:52:17 -0400 X-MC-Unique: Bw242PPfPyOHgoyzhe_mmg-1 X-Mimecast-MFC-AGG-ID: Bw242PPfPyOHgoyzhe_mmg_1777467137 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-6720c69fb32so12484525a12.1 for ; Wed, 29 Apr 2026 05:52:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467136; x=1778071936; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wXLUrx6pg+td5LUoEiI8ndSrVAlCTiqcgH/61RABzys=; b=l8oCyEltaGFmB24WUZ8jP5Q3QB5CbYeObJgLaElulB5C0KOj/JEAwEd5ZaZUyeHkzW ARJDWMq3cU1Wj2VqHT5Puf+WwdVKekswyyxy7WotDf1FxDkBIKrW7SIQJcdYOXLXt9Sb aAEMFctDkYVt30BCmWJdafJLkFATdHj7GJKWKeeeTnc0w+VbhNl41BAZVzvKP1Ab/eBW dXUQcAGwM0MhKMeWhbtDD5vGgSQLtJyiMGoeIX1HOFiJsHT2STpnjkoEVtqDnT2SizP/ FzrlDSpCbqP81Gu1drOZH3m2WBHYpiIXLpdyculP92z8nBbwtlpIYTvLcn63NkKZhc6u 9+QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467136; x=1778071936; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=wXLUrx6pg+td5LUoEiI8ndSrVAlCTiqcgH/61RABzys=; b=RRkNhunCA9ajv1SzuKcB6OfkJx8x2de01r5G639O+KFjxuEf0O/k8WJNXVMNMEDs1V igxGWdKH49z20dwKzAeb0V8Sa7Wxm1abLD+5sd0zOV6TaVzS7J0Ka/UShNTcxHLoreiO WW4tLeySbMH73BAFvGmJjx14M6+i0wUg6f7igsurMaVHSNQxp3zjiTP2NfdmkIsa5diG FSxuZwZAzuU5WUJ5mioDQ8HAYt5U6apoAqsSQPM6ZtKbJemXAI2q68N+2rYr4WLDIIEJ LdBrquo7DEOnqy8fppzFJQPBp/lEuWbLmwTlnW6Mc7PGIsgE9E6wnVDrXSkEE36FEXt9 rzEg== X-Gm-Message-State: AOJu0YyM3hx/5C6qlusCBxPY3+L52t1mf0kCS2Hok6C2pPVbO6OA/H5d 5C5STtOmy/5qvMBaXWdKrCe/AZr1gyEPwiCwxsEvdzAT4/7/mYXSBHMqWXYlB6VBXNDV+zJMdaF 0xVvPqwKVEniE3gdB1sTZ2gTS3Aj28tUe4WY7lTkmLOIl0svhWwDsC6tKjT0ZqqxBlA== X-Gm-Gg: AeBDieuYvhZ9OzJj/ExshSqkBivkj7+1vc/3ehDS6XuXs5Od1kz5QALzjnZSqPB6fPT 12h6DsrXO0wWV2vr6KEuEiuf5c+BTYguKiv8rOAfUGKPiIF8VXSqFX8siBY/CbAl4SRMySYULGp CjyxTVLIIWheiGKjTqDb7uJUXdCfIlgTxN9eyWwwoYojkgjn3Mu0eJKZ9s9q/IL9btW72ZdpgvQ nrgQJqRBBpyhXZarb1lOHTtzrariE9JA/ISs+iHyRxhgc8Ota80V8UMzz71VKTtArmKkkeStKuT LWdc6YZEe7q9pVDEo8QTVtoD4ePieVbTOVZA8VFhr+89Gm2XtW5ZJ7FWc5nde7jnvOtmVKU/YD4 hNMiSRuVXgL1B2k8YuwM6THjrUjr5U2e6BCHW35gheyYEW5mXkeu7YLzJMO3TSvoGTA== X-Received: by 2002:a05:6402:358a:b0:670:ef2a:217e with SMTP id 4fb4d7f45d1cf-679bafde393mr3845043a12.0.1777467136221; Wed, 29 Apr 2026 05:52:16 -0700 (PDT) X-Received: by 2002:a05:6402:358a:b0:670:ef2a:217e with SMTP id 4fb4d7f45d1cf-679bafde393mr3845020a12.0.1777467135610; Wed, 29 Apr 2026 05:52:15 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:15 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Date: Wed, 29 Apr 2026 12:51:56 +0000 Message-Id: <20260429125206.1512203-2-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Define named bit-position constants for all CEPH_I_* inode flags and derive the bitmask values from them. This gives every flag a named _BIT constant usable with the test_bit/set_bit/clear_bit family. The intentionally unused bit position 1 is documented inline. Convert all flag modifications to use atomic bitops (set_bit, clear_bit, test_and_clear_bit). The previous code mixed lockless atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic read-modify-write (|=3D / &=3D ~) on other flags sharing the same unsigned long. A concurrent non-atomic RMW can clobber an adjacent lockless atomic update -- for example, a lockless clear_bit(ERROR_WRITE) could be silently resurrected by a concurrent ci->i_ceph_flags |=3D CEPH_I_FLUSH under the spinlock. Using atomic bitops for all modifications eliminates this class of race entirely. Flags whose only users are now the _BIT form (ERROR_WRITE, ASYNC_CHECK_CAPS) have their old mask defines removed to document that callers must use the _BIT constant with the set_bit/test_bit family. ERROR_FILELOCK and SHUTDOWN retain their mask defines because they are still used via bitmask tests in lockless readers (ceph_inode_is_shutdown, reconnect_caps_cb). Flag reads under i_ceph_lock continue to use bitmask tests where the tested flag is only modified under the same lock; this is safe because the lock serialises both the read and the write. The remaining flags continue to use non-atomic bitmask operations under i_ceph_lock, which is correct and unchanged. The lockless reader ceph_inode_is_shutdown() retains the READ_ONCE() snapshot plus bitmask test pattern -- the single atomic load into a local variable is correct and avoids a second memory access that test_bit() would require. It now uses the named CEPH_I_SHUTDOWN mask constant instead of an inline BIT(). The direct assignment in ceph_finish_async_create() is converted from i_ceph_flags =3D CEPH_I_ASYNC_CREATE to set_bit(). This inode is I_NEW at this point -- still invisible to other threads and guaranteed to have zero flags from alloc_inode -- so either form is safe, but set_bit() keeps the conversion uniform. The only remaining direct assignment (alloc_inode zeroing) operates on an inode that is not yet visible to other threads, so it is safe without atomic ops. The dead precomputed flags variable in ceph_pool_perm_check() is removed; the check: loop re-reads flags from i_ceph_flags after the set_bit() calls, keeping a single source of truth. Co-developed-by: Viacheslav Dubeyko Signed-off-by: Viacheslav Dubeyko Signed-off-by: Alex Markuze --- fs/ceph/addr.c | 17 ++++++------ fs/ceph/caps.c | 24 ++++++++--------- fs/ceph/file.c | 13 ++++----- fs/ceph/inode.c | 4 +-- fs/ceph/locks.c | 22 ++++----------- fs/ceph/mds_client.c | 3 ++- fs/ceph/mds_client.h | 2 +- fs/ceph/snap.c | 2 +- fs/ceph/super.h | 64 +++++++++++++++++++++++--------------------- fs/ceph/xattr.c | 2 +- 10 files changed, 72 insertions(+), 81 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 2090fc78529c..35c5fdb5a448 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -2583,20 +2583,19 @@ int ceph_pool_perm_check(struct inode *inode, int n= eed) if (ret < 0) return ret; =20 - flags =3D CEPH_I_POOL_PERM; - if (ret & POOL_READ) - flags |=3D CEPH_I_POOL_RD; - if (ret & POOL_WRITE) - flags |=3D CEPH_I_POOL_WR; - spin_lock(&ci->i_ceph_lock); if (pool =3D=3D ci->i_layout.pool_id && pool_ns =3D=3D rcu_dereference_raw(ci->i_layout.pool_ns)) { - ci->i_ceph_flags |=3D flags; - } else { + set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); + if (ret & POOL_READ) + set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags); + if (ret & POOL_WRITE) + set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags); + } else { pool =3D ci->i_layout.pool_id; - flags =3D ci->i_ceph_flags; } + /* Re-read flags under the lock so check: sees the updated bits. */ + flags =3D ci->i_ceph_flags; spin_unlock(&ci->i_ceph_lock); goto check; } diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index d51454e995a8..cb9e78b713d9 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_c= lient *mdsc, =20 doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list); @@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, stru= ct ceph_cap *cap, ceph_cap_string(revoking)); BUG_ON((retain & CEPH_CAP_PIN) =3D=3D 0); =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH; + clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); =20 cap->issued &=3D retain; /* drop bits we don't want */ /* @@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, last_tid =3D capsnap->cap_flush.tid; } =20 - ci->i_ceph_flags &=3D ~CEPH_I_FLUSH_SNAPS; + clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 while (first_tid <=3D last_tid) { struct ceph_cap *cap =3D ci->i_auth_cap; @@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int = flags) =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - ci->i_ceph_flags |=3D CEPH_I_ASYNC_CHECK_CAPS; + set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags); =20 /* Don't send messages until we get async create reply */ spin_unlock(&ci->i_ceph_lock); @@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_clie= nt *mdsc, if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) return; =20 - ci->i_ceph_flags &=3D ~CEPH_I_KICK_FLUSH; + clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); =20 list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) { if (cf->is_capsnap) { @@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_cl= ient *mdsc, __kick_flushing_caps(mdsc, session, ci, oldest_flush_tid); } else { - ci->i_ceph_flags |=3D CEPH_I_KICK_FLUSH; + set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags); } =20 spin_unlock(&ci->i_ceph_lock); @@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int = need, int want, spin_lock(&ci->i_ceph_lock); =20 if ((flags & CHECK_FILELOCK) && - (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) { + test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { doutc(cl, "%p %llx.%llx error filelock\n", inode, ceph_vinop(inode)); ret =3D -EIO; @@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_i= nfo *ci, BUG_ON(capsnap->cap_flush.tid > 0); ceph_put_snap_context(capsnap->context); if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps)) - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); =20 list_del(&capsnap->ci_item); ceph_put_cap_snap(capsnap); @@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_inf= o *ci, int nr, if (ceph_try_drop_cap_snap(ci, capsnap)) { put++; } else { - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); flush_snaps =3D true; } } @@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode, =20 if (ci->i_layout.pool_id !=3D old_pool || extra_info->pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 extra_info->pool_ns =3D old_ns; =20 @@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode) doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); spin_lock(&mdsc->cap_delay_lock); - ci->i_ceph_flags |=3D CEPH_I_FLUSH; + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags); if (!list_empty(&ci->i_cap_delay_list)) list_del_init(&ci->i_cap_delay_list); list_add_tail(&ci->i_cap_delay_list, @@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct = ceph_cap *cap, bool *invali =20 if (atomic_read(&ci->i_filelock_ref) > 0) { /* make further file lock syscall return -EIO */ - ci->i_ceph_flags |=3D CEPH_I_ERROR_FILELOCK; + set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); pr_warn_ratelimited_client(cl, " dropping file locks for %p %llx.%llx\n", inode, ceph_vinop(inode)); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 5e7c73a29aa3..e2622f1cfbff 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -579,12 +579,12 @@ static void wake_async_create_waiters(struct inode *i= node, =20 spin_lock(&ci->i_ceph_lock); if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) { - clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags); + /* Serialized by i_ceph_lock; the two ops touch different bits. */ + clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags); =20 - if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) { - ci->i_ceph_flags &=3D ~CEPH_I_ASYNC_CHECK_CAPS; + if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, + &ci->i_ceph_flags)) check_cap =3D true; - } } ceph_kick_flushing_inode_caps(session, ci); spin_unlock(&ci->i_ceph_lock); @@ -747,7 +747,8 @@ static int ceph_finish_async_create(struct inode *dir, = struct inode *inode, * that point and don't worry about setting * CEPH_I_ASYNC_CREATE. */ - ceph_inode(inode)->i_ceph_flags =3D CEPH_I_ASYNC_CREATE; + set_bit(CEPH_I_ASYNC_CREATE_BIT, + &ceph_inode(inode)->i_ceph_flags); unlock_new_inode(inode); } if (d_in_lookup(dentry) || d_really_is_negative(dentry)) { @@ -2422,7 +2423,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, st= ruct iov_iter *from) =20 if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) =3D=3D 0 || (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) || - (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { + test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) { struct ceph_snap_context *snapc; struct iov_iter data; =20 diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index d99e12d1100b..f75d66760d54 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1142,7 +1142,7 @@ int ceph_fill_inode(struct inode *inode, struct page = *locked_page, rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns); =20 if (ci->i_layout.pool_id !=3D old_pool || pool_ns !=3D old_ns) - ci->i_ceph_flags &=3D ~CEPH_I_POOL_PERM; + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags); =20 pool_ns =3D old_ns; =20 @@ -3199,7 +3199,7 @@ void ceph_inode_shutdown(struct inode *inode) bool invalidate =3D false; =20 spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_SHUTDOWN; + set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags); p =3D rb_first(&ci->i_caps); while (p) { struct ceph_cap *cap =3D rb_entry(p, struct ceph_cap, ci_node); diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index dd764f9c64b9..c4ff2266bb94 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl) ci =3D ceph_inode(inode); if (atomic_dec_and_test(&ci->i_filelock_ref)) { /* clear error when all locks are released */ - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_FILELOCK; - spin_unlock(&ci->i_ceph_lock); + clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags); } fl->fl_u.ceph.inode =3D NULL; iput(inode); @@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file= _lock *fl) else if (IS_SETLKW(cmd)) wait =3D 1; =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl)) posix_lock_file(file, fl, NULL); - return err; + return -EIO; } =20 if (lock_is_read(fl)) @@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct fil= e_lock *fl) =20 doutc(cl, "fl_file: %p\n", fl->c.flc_file); =20 - spin_lock(&ci->i_ceph_lock); - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) { - err =3D -EIO; - } - spin_unlock(&ci->i_ceph_lock); - if (err < 0) { + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) { if (lock_is_unlock(fl)) locks_lock_file_wait(file, fl); - return err; + return -EIO; } =20 if (IS_SETLKW(cmd)) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index b1746273f186..ccf0d53dde2b 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3613,7 +3613,8 @@ static void __do_request(struct ceph_mds_client *mdsc, =20 spin_lock(&ci->i_ceph_lock); cap =3D ci->i_auth_cap; - if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds !=3D cap->mds) { + if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) && + mds !=3D cap->mds) { doutc(cl, "session changed for auth cap %d -> %d\n", cap->session->s_mds, session->s_mds); =20 diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 0428a5eaf28c..e91a199d56fd 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -658,7 +658,7 @@ static inline int ceph_wait_on_async_create(struct inod= e *inode) { struct ceph_inode_info *ci =3D ceph_inode(inode); =20 - return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT, + return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT, TASK_KILLABLE); } =20 diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 52b4c2684f92..9b79a5eaca93 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci, return 0; } =20 - ci->i_ceph_flags |=3D CEPH_I_FLUSH_SNAPS; + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags); doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=3D%llu\n", inode, ceph_vinop(inode), capsnap, capsnap->context, capsnap->context->seq, ceph_cap_string(capsnap->dirty), diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 29a980e22dc2..66b047606d65 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -655,23 +655,34 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, /* * Ceph inode. */ -#define CEPH_I_DIR_ORDERED (1 << 0) /* dentries in dir are ordered */ -#define CEPH_I_FLUSH (1 << 2) /* do not delay flush of dirty metadata */ -#define CEPH_I_POOL_PERM (1 << 3) /* pool rd/wr bits are valid */ -#define CEPH_I_POOL_RD (1 << 4) /* can read from pool */ -#define CEPH_I_POOL_WR (1 << 5) /* can write to pool */ -#define CEPH_I_SEC_INITED (1 << 6) /* security initialized */ -#define CEPH_I_KICK_FLUSH (1 << 7) /* kick flushing caps */ -#define CEPH_I_FLUSH_SNAPS (1 << 8) /* need flush snapss */ -#define CEPH_I_ERROR_WRITE (1 << 9) /* have seen write errors */ -#define CEPH_I_ERROR_FILELOCK (1 << 10) /* have seen file lock errors */ -#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ -#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) -#define CEPH_ASYNC_CREATE_BIT (12) /* async create in flight for this */ -#define CEPH_I_ASYNC_CREATE (1 << CEPH_ASYNC_CREATE_BIT) -#define CEPH_I_SHUTDOWN (1 << 13) /* inode is no longer usable */ -#define CEPH_I_ASYNC_CHECK_CAPS (1 << 14) /* check caps immediately after = async - creating finishes */ +#define CEPH_I_DIR_ORDERED_BIT (0) /* dentries in dir are ordered */ + /* bit 1 historically unused */ +#define CEPH_I_FLUSH_BIT (2) /* do not delay flush of dirty metadata */ +#define CEPH_I_POOL_PERM_BIT (3) /* pool rd/wr bits are valid */ +#define CEPH_I_POOL_RD_BIT (4) /* can read from pool */ +#define CEPH_I_POOL_WR_BIT (5) /* can write to pool */ +#define CEPH_I_SEC_INITED_BIT (6) /* security initialized */ +#define CEPH_I_KICK_FLUSH_BIT (7) /* kick flushing caps */ +#define CEPH_I_FLUSH_SNAPS_BIT (8) /* need flush snaps */ +#define CEPH_I_ERROR_WRITE_BIT (9) /* have seen write errors */ +#define CEPH_I_ERROR_FILELOCK_BIT (10) /* have seen file lock errors */ +#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */ +#define CEPH_I_ASYNC_CREATE_BIT (12) /* async create in flight for this */ +#define CEPH_I_SHUTDOWN_BIT (13) /* inode is no longer usable */ +#define CEPH_I_ASYNC_CHECK_CAPS_BIT (14) /* check caps after async creatin= g finishes */ + +#define CEPH_I_DIR_ORDERED (1 << CEPH_I_DIR_ORDERED_BIT) +#define CEPH_I_FLUSH (1 << CEPH_I_FLUSH_BIT) +#define CEPH_I_POOL_PERM (1 << CEPH_I_POOL_PERM_BIT) +#define CEPH_I_POOL_RD (1 << CEPH_I_POOL_RD_BIT) +#define CEPH_I_POOL_WR (1 << CEPH_I_POOL_WR_BIT) +#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT) +#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT) +#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT) +#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) +#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT) +#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT) +#define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT) =20 /* * Masks of ceph inode work. @@ -684,27 +695,18 @@ static inline struct inode *ceph_find_inode(struct su= per_block *sb, =20 /* * We set the ERROR_WRITE bit when we start seeing write errors on an inode - * and then clear it when they start succeeding. Note that we do a lockless - * check first, and only take the lock if it looks like it needs to be cha= nged. - * The write submission code just takes this as a hint, so we're not too - * worried if a few slip through in either direction. + * and then clear it when they start succeeding. The write submission code + * just takes this as a hint, so we're not too worried if a few slip throu= gh + * in either direction. */ static inline void ceph_set_error_write(struct ceph_inode_info *ci) { - if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags |=3D CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void ceph_clear_error_write(struct ceph_inode_info *ci) { - if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) { - spin_lock(&ci->i_ceph_lock); - ci->i_ceph_flags &=3D ~CEPH_I_ERROR_WRITE; - spin_unlock(&ci->i_ceph_lock); - } + clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags); } =20 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci, diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 5f87f62091a1..7cf9e908c2fe 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const ch= ar *name, void *value, if (current->journal_info && !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) && security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN)) - ci->i_ceph_flags |=3D CEPH_I_SEC_INITED; + set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags); out: spin_unlock(&ci->i_ceph_lock); return err; --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F3433FB074 for ; Wed, 29 Apr 2026 12:52:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467145; cv=none; b=HcSLZQGU1RmW7UNSu1clvJMtnX1N+Z1gmsXX7Lyb/7Si8LWGinUCCjgP+GoDsUajc9c9UBH1gt5/91rGvbOhMjHllMZ3CEMiLau4WE5jRSxsvKEw6qpkVQWRK5CQhWJb+qfD3syna40k64pGpIiWSNPctcSUtL+kXkXZ8gYLaBs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467145; c=relaxed/simple; bh=n6py2gDKtkRa4hq2U3u/EpuJJ52f5YtnTxhElAFdc98=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=t8q6kwhyBrEOwlbF+I8/raBykkDmQ1QHcSOY02r4K3s+vX+rOCMKSmmAsPoYtn616WD4gPLzIGf9IbKGCEewastlZCSKGdpvxTzzwGg9jxj5gMqu5P95OmiABLgvnvoXU/xCCaNNT1aS5ZCAsBFHsvgc2hu3En67hzVp11CcT5I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=bHfjrTBm; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=DAZ1JPfe; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="bHfjrTBm"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="DAZ1JPfe" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467141; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QaBh0LY8b0LH2S35QVWwlSLJbEkN44+12f5qWMSzHpY=; b=bHfjrTBmBtiB8lKT/QdTOrOwf0Iyv4BwwyfmTb7X6ForWe70rJMsVXUzgk2ZmKw4HE2B3G YqZ/WxttfQvciydKLvIxl2at6YV1C9PRpAwR+7zRtWCSQFZpY62MBGweNXfxRkyvyy5bTq xsnMBJ4EI712TJYCe/pzF33OumVwaGc= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-610-KK0K9yVXPN2DQ-nZT_CpWw-1; Wed, 29 Apr 2026 08:52:19 -0400 X-MC-Unique: KK0K9yVXPN2DQ-nZT_CpWw-1 X-Mimecast-MFC-AGG-ID: KK0K9yVXPN2DQ-nZT_CpWw_1777467139 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-66c165a7a8cso11538189a12.1 for ; Wed, 29 Apr 2026 05:52:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467138; x=1778071938; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QaBh0LY8b0LH2S35QVWwlSLJbEkN44+12f5qWMSzHpY=; b=DAZ1JPfeKN7hbn9z3FLpbJQ2c6G3lpPeejT817AC0lBkjVHncOi1yf1BFLiHcH1cE7 9QbRkY7rJauYegwdysm60dm52zuPy64q+IRZv5TlBWjtj8fCSPn51S1Anz78oWr3YO8x 6elOKcbUMPd50xw47L5WwPC0day/kNgiPDnmiCz7qQHQi9dyRO/4C44dqEWdIYmQkeEB XpZKe7p2+yZFMlHf8poKjIEPafY068s43mBuCtSOUwkA/2/+r82J2+OpkoVOoQceBBb/ OToQv11kXs/SyZ81gOMPqVD6VAkoNnfWju1Y/tBqncZ41v6FXfjyFMN95ylhklCD6sSH wrkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467138; x=1778071938; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=QaBh0LY8b0LH2S35QVWwlSLJbEkN44+12f5qWMSzHpY=; b=pRh8KZS9wURU7ws6fQ7++t9UMaKrHbM2FZTOBj98JFZtgVHCpx+0TPAQW/t43sM42P AVcgCH+kiNXvpQ1nlVQQr/N+oHd/W1BJlq/VNbXMVtX+G86OM9RBSz+iz5NoBhVNbSh6 DOJUWaSCzk/3FPBEDyMuE6RWhTCPev6ux0yICf9hAyMusBMPKfIvGltPVBW3/t+GZ6fz PCTyIJu6Ro12UxjLzsX5UBZp7GyUG0t6NBz9q2Oc8xsIgEOlazZP404kSp4qgBvqgoPv u5SHQmwmgmuFqCQERqNQmeGR/3NMZSuQkSBx6iYdBOwa+O9bAbdnyN0/l8QoYAJ4ufFQ sEkQ== X-Gm-Message-State: AOJu0YwbSDeB7/1rmZSQqCGFLNcQR16oA6C7RUNf9bhrINrwZLYou4V7 Ks13Hqli+IY5sQMaycBOokx8yX9HMFCrQ1e8iS4h283OTaqzG9hgqRM0J00KJSEmJKK94CljIGO a0Pv6XhaBrf3zl3WY+vDspeTAp1sGWDB5WG7h9dREuDgYjD9253ktlzV3DZxBhWxE/++W9THicM 7C X-Gm-Gg: AeBDietwnUTHkeOLFH3+0+SC1rHHkDj4Ja2+Lh9t/1tkY9NW+7ZxkSkfwxj2pe7sA/k 63c1nLPHF/xWAaCSC1S6m+PiD5mdzb7RDV+K5XL1ay9Kp+Bl46OnSInzWVc2G56N9Op43shwY2m xaLWZNqiIPtZPwgw8Vnhjd2iMOucaKgPy6/11kkc4nT7/Pkrc1VWB7PA/crOXGR6Po07vwsZ4NB rBnhXsoh+wgpgXwDipV4fJvTEj6dlbxal0e9lQB6/narrKv8zZTFrX6oa3NkRLPDIrAYfxLYuvZ jfQBrEwwi4z6RE1HEDGqD6JJDvP1RaWuoFUo5vsh6hkjqFpNHa8iGXLYTlNCcf7Bvq3NbWA/x6x Js7MCgcZwXpEAvEvPdJx4kf1Knwv1aNAcpqhyICc1HIoIk4xkHPz7hNqQJBD0InnA/g== X-Received: by 2002:a05:6402:538a:b0:674:40c3:f047 with SMTP id 4fb4d7f45d1cf-679bb05e13emr3933434a12.12.1777467138319; Wed, 29 Apr 2026 05:52:18 -0700 (PDT) X-Received: by 2002:a05:6402:538a:b0:674:40c3:f047 with SMTP id 4fb4d7f45d1cf-679bb05e13emr3933407a12.12.1777467137799; Wed, 29 Apr 2026 05:52:17 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:16 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze , Viacheslav Dubeyko Subject: [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Date: Wed, 29 Apr 2026 12:51:57 +0000 Message-Id: <20260429125206.1512203-3-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace the __force __le32 cast with cpu_to_le32() for the flock_len field in reconnect_caps_cb(). The old code used a type-system bypass to silence sparse; the new form uses the proper endian conversion macro. Also switch from a raw bitmask test against i_ceph_flags to test_bit() on the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the unsigned long flags field after the bit-position conversion. Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers use the _BIT form with test_bit/set_bit/clear_bit. Reviewed-by: Viacheslav Dubeyko Signed-off-by: Alex Markuze --- fs/ceph/mds_client.c | 5 +++-- fs/ceph/super.h | 1 - 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index ccf0d53dde2b..871f0eef468d 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4693,8 +4693,9 @@ static int reconnect_caps_cb(struct inode *inode, int= mds, void *arg) rec.v2.issued =3D cpu_to_le32(cap->issued); rec.v2.snaprealm =3D cpu_to_le64(ci->i_snap_realm->ino); rec.v2.pathbase =3D cpu_to_le64(path_info.vino.ino); - rec.v2.flock_len =3D (__force __le32) - ((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1); + rec.v2.flock_len =3D cpu_to_le32( + test_bit(CEPH_I_ERROR_FILELOCK_BIT, + &ci->i_ceph_flags) ? 0 : 1); } else { struct timespec64 ts; =20 diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 66b047606d65..30911ccf961e 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -681,7 +681,6 @@ static inline struct inode *ceph_find_inode(struct supe= r_block *sb, #define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT) #define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT) #define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT) -#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT) #define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT) =20 /* --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 611253FD129 for ; Wed, 29 Apr 2026 12:52:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467149; cv=none; b=M6eB71aRHHLCiW7dAkWdz5G86iDpkqb64XdzX/OGJBTLZcEUaFhTGbvDkqNE+UsKHWbic8PQSpnroDtgidQBl+s/vcPFdrc2BeNNi+ubtLpd588Ddw0zsIPhCKHWbXA+yl821Zj34rdYCR08dC/dXA+7KM/DES2l2gOoyA4xeF8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467149; c=relaxed/simple; bh=tnPqC8BgXfwg4wC768hD+ltUOyxbS9+OmAJECaBh87g=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=mi9fpoEMBBo8OVI80jCIf8oEYQcjxgtMcOqWWPCvHfNwJkNvlzVE+s1O9/VnD9V7m9edHXSkF5XHuazN1HvcTlVeH+ppWqAByTSkOsQPK7bqAcqrhBGeBoZsIjISbNSnMOMKB66S0cS/AHg5+fdqwQ5LWxub5SVUBWWbZkhu0pA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=S3AKdZFq; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=bbCUkg9V; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="S3AKdZFq"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="bbCUkg9V" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467146; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=asL+rz7v2ijvPcwX3iOzgMmiTXf19M9ZSCghMhBcDBY=; b=S3AKdZFq+Ml3q45aWZ/jCI6rW7Gw/O7CoCdEKyVTlxSwnppTt6Hj8PgAJ4aGtO7Zlf5VXY LKdXlf2tgoSGHjchHfrIO8WBhkcRmOrlsYjmsZ/+Sx9eZhC1JyI5bjGFgGjh/TZoQdxJvU dhhvMKf8gbXawuHIn8aaBUFeQkR6ce4= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-649-nUkUrjzpN-yNuwM5mqXV2Q-1; Wed, 29 Apr 2026 08:52:24 -0400 X-MC-Unique: nUkUrjzpN-yNuwM5mqXV2Q-1 X-Mimecast-MFC-AGG-ID: nUkUrjzpN-yNuwM5mqXV2Q_1777467143 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-66f103b3141so8362741a12.0 for ; Wed, 29 Apr 2026 05:52:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467143; x=1778071943; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=asL+rz7v2ijvPcwX3iOzgMmiTXf19M9ZSCghMhBcDBY=; b=bbCUkg9VxrXpSnl0j1uxuGJjzNLIN7bZtQ/i8wxxCuN3sfTCjiZ8C4qokZv0XugU2Q xDHU0mKZIHc74hSWUV+E0l50PgbZe0NB8jM4B88zeGAuMJ60ymdBul+LSSO9gYBiutTs kMQ0X/JQ7eENQYarRNQxNDmN3k6NEvj73tr7BIbPz+IjMaonAEQ7NOO2b5tFgxgikefn TwSrt2nmW0n2/RqB3jRBQEIUb0edHLNvCpJhqi/Wkz7A6Vr2qWsPj+zmNqufVBW+VNGE vouZGjs4wr9EI4cZ36AYr6pFXZrjOlK/f0vUkyzArxjJlswJP1eXLDi2SPaIrpJUNRYY 33UA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467143; x=1778071943; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=asL+rz7v2ijvPcwX3iOzgMmiTXf19M9ZSCghMhBcDBY=; b=FGP4HvteL9JAWS+KH5Vs0BHyjDurHHCLcp/EwRq7X4vPa/5eigkN66iMDnGXrt16kv 0NdQEaGqL4LpdnMeP8RBjyvuPBrq6HeVDR8iMlADk8bOnRg27A25C5FPVAhtqyd+pTHd 1dFnTlkGAPtuiPqo7d8/29A5t74nYkv5VEWoGB9dZ8o9qke0/UxuXy9lGTzKxAasrt3Y e1uo65j+v8yUi019ZnELEcXitViI0IZugb2/0Ngt4QbxYFg+6LuZzL6/vooKJtFKapIc EQV3IuEys2KRe8Awj+yP4rBN6jiEqGi1bF77zzT/kAJA5VvhCEZOh020YwJg1dlohZ+a ck1g== X-Gm-Message-State: AOJu0YxnN6zOQwVTS5WiEUgHr0yfsu4WSnjvCVoJcm+7tdH2U+LQ1aGg 4XFO2UvY48Pd/ZEGHfbPJDYV5YIsmlfjMTYWaXCMtyVLTwljctZ2JCg1b1mEya5fqP6O8QIdPYn EQ+fgscSp2Dk0BNOjiv1qnbRhHmxfbu/LiwSHt576K8HN4NqeX051ffbbPUXeDLV3Dg== X-Gm-Gg: AeBDievr47b2QoPuVLA8C+wPh9QgPoIZb9X5eA1PcQv8JOsBY1pLiMmpLQzjDCKN8OI fAyBW6iGkmj4UPX71vcblO8nTYKfcM3OGBz0kWOzqBf/GBHFMuajpyjEuVQPclkE7V+9IqgQ6GV Rwlko1luThqnMktl9ibtQcHB8qMOS8XFvZJ3WIAmFky/LJT6gWspZo/i9erFf07rmQaZ5SBZlKU uCBcatKZN+VJyay3sdZkIG/31PmZ2OoIKYAo1qUsZe3uTYnJFtOwczn1dVCMyXjMBqQghGR1iun 33iM6rU6YFLLvA4uwCb1BOX87zgyXnZrIYmu1TJfUEp1+lAX8XaxIut5QKdVkIradFIFCNBW6vn HK7AfJ3oQcE3rMb0J3Cmvhcr6jtOQmKrcULiQiJU4tXeCG5v88dOqfxnt8dLll/Jc4w== X-Received: by 2002:a05:6402:5057:b0:676:98a0:1c8d with SMTP id 4fb4d7f45d1cf-679bb04a803mr2891179a12.1.1777467140462; Wed, 29 Apr 2026 05:52:20 -0700 (PDT) X-Received: by 2002:a05:6402:5057:b0:676:98a0:1c8d with SMTP id 4fb4d7f45d1cf-679bb04a803mr2891167a12.1.1777467139847; Wed, 29 Apr 2026 05:52:19 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:19 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Date: Wed, 29 Apr 2026 12:51:58 +0000 Message-Id: <20260429125206.1512203-4-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Change send_mds_reconnect() to return an error code so callers can detect and report reconnect failures instead of silently ignoring them. Add early bailout checks for sessions that are already closed, rejected, or unregistered, which avoids sending reconnect messages for sessions that can no longer be recovered. The early -ESTALE and -ENOENT bailouts use a separate fail_return label that skips the pr_err_client diagnostic, since these codes indicate expected concurrent-teardown races rather than genuine reconnect build failures. Move the "reconnect start" log after the early-bailout checks so it only appears for sessions that actually proceed with reconnect. Save the prior session state before transitioning to RECONNECTING, and restore it in the failure path. Without this, a transient build or encoding failure (-ENOMEM, -ENOSPC) strands the session in RECONNECTING indefinitely because check_new_map() only retries sessions in RESTARTING state. Rewrite mds_peer_reset() to handle the case where the MDS is past its RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT messages because it only accepts them during its own RECONNECT window after restart. Previously, the client would send a doomed reconnect that the MDS would reject or ignore. Now, the client tears the session down locally and lets new requests re-open a fresh session, which is the correct recovery for this scenario. The RECONNECTING state is handled on the same teardown path, since the MDS will reject reconnect attempts from an active client regardless of the session's local state. Add explicit cases for CLOSED and REJECTED session states in mds_peer_reset() since these are terminal states where a connection drop is expected behavior. The session teardown path in mds_peer_reset() follows the established drop-and-reacquire locking pattern from check_new_map(): take mdsc->mutex for session unregistration, release it, then take s->s_mutex separately for cleanup. This avoids introducing a new simultaneous lock nesting pattern. Log reconnect failures from check_new_map() and mds_peer_reset() at pr_warn level rather than pr_err, since return codes like -ESTALE (closed/rejected session) and -ENOENT (unregistered session) are expected during concurrent teardown. Log dropped messages for unregistered sessions via doutc() (dynamic debug) rather than pr_info, as post-reset message arrival is routine and does not warrant unconditional logging. Signed-off-by: Alex Markuze --- fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 155 insertions(+), 14 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 871f0eef468d..b62abae72c4c 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *= session, break; =20 case CEPH_SESSION_REJECT: - WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING); - pr_info_client(cl, "mds%d rejected session\n", - session->s_mds); + WARN_ON(session->s_state !=3D CEPH_MDS_SESSION_OPENING && + session->s_state !=3D CEPH_MDS_SESSION_RECONNECTING); + if (session->s_state =3D=3D CEPH_MDS_SESSION_RECONNECTING) + pr_info_client(cl, "mds%d reconnect rejected\n", + session->s_mds); + else + pr_info_client(cl, "mds%d rejected session\n", + session->s_mds); session->s_state =3D CEPH_MDS_SESSION_REJECTED; cleanup_session_requests(mdsc, session); remove_session_caps(session); @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, in= t mds, void *arg) cap->mseq =3D 0; /* and migrate_seq */ cap->cap_gen =3D atomic_read(&cap->session->s_cap_gen); =20 + /* + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect. + * Instead, locks are submitted for best-effort MDS reclaim + * via the flock_len field below. If reclaim fails (e.g., + * another client grabbed a conflicting lock), future lock + * operations will fail and set the error flag at that point. + */ + /* These are lost when the session goes away */ if (S_ISDIR(inode->i_mode)) { if (cap->issued & CEPH_CAP_DIR_CREATE) { @@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_clien= t *mdsc, * * This is a relatively heavyweight operation, but it's rare. */ -static void send_mds_reconnect(struct ceph_mds_client *mdsc, - struct ceph_mds_session *session) +static int send_mds_reconnect(struct ceph_mds_client *mdsc, + struct ceph_mds_session *session) { struct ceph_client *cl =3D mdsc->fsc->client; struct ceph_msg *reply; int mds =3D session->s_mds; int err =3D -ENOMEM; + int old_state; struct ceph_reconnect_state recon_state =3D { .session =3D session, }; LIST_HEAD(dispose); =20 - pr_info_client(cl, "mds%d reconnect start\n", mds); - recon_state.pagelist =3D ceph_pagelist_alloc(GFP_NOFS); if (!recon_state.pagelist) goto fail_nopagelist; @@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_clien= t *mdsc, xa_destroy(&session->s_delegated_inos); =20 mutex_lock(&session->s_mutex); + if (session->s_state =3D=3D CEPH_MDS_SESSION_CLOSED || + session->s_state =3D=3D CEPH_MDS_SESSION_REJECTED) { + pr_info_client(cl, "mds%d skipping reconnect, session %s\n", + mds, + ceph_session_state_name(session->s_state)); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ESTALE; + goto fail_return; + } + + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || mdsc->sessions[mds] !=3D session) { + mutex_unlock(&mdsc->mutex); + pr_info_client(cl, + "mds%d skipping reconnect, session unregistered\n", + mds); + mutex_unlock(&session->s_mutex); + ceph_msg_put(reply); + err =3D -ENOENT; + goto fail_return; + } + mutex_unlock(&mdsc->mutex); + + pr_info_client(cl, "mds%d reconnect start\n", mds); + old_state =3D session->s_state; session->s_state =3D CEPH_MDS_SESSION_RECONNECTING; session->s_seq =3D 0; =20 @@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_clie= nt *mdsc, =20 up_read(&mdsc->snap_rwsem); ceph_pagelist_release(recon_state.pagelist); - return; + return 0; =20 fail: ceph_msg_put(reply); up_read(&mdsc->snap_rwsem); + /* + * Restore prior session state so map-driven reconnect logic + * (check_new_map) can retry. Without this, a transient build + * failure strands the session in RECONNECTING indefinitely. + */ + session->s_state =3D old_state; mutex_unlock(&session->s_mutex); fail_nomsg: ceph_pagelist_release(recon_state.pagelist); fail_nopagelist: pr_err_client(cl, "error %d preparing reconnect for mds%d\n", err, mds); - return; + return err; + +fail_return: + /* + * Early-exit path for expected concurrent-teardown races + * (-ESTALE for closed/rejected sessions, -ENOENT for + * unregistered sessions). Skip the pr_err_client diagnostic + * since these are not genuine reconnect build failures. + */ + ceph_pagelist_release(recon_state.pagelist); + return err; } =20 =20 @@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *md= sc, */ if (s->s_state =3D=3D CEPH_MDS_SESSION_RESTARTING && newstate >=3D CEPH_MDS_STATE_RECONNECT) { + int rc; + mutex_unlock(&mdsc->mutex); clear_bit(i, targets); - send_mds_reconnect(mdsc, s); + rc =3D send_mds_reconnect(mdsc, s); + if (rc) + pr_warn_client(cl, + "mds%d reconnect failed: %d\n", + i, rc); mutex_lock(&mdsc->mutex); } =20 @@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *md= sc, } doutc(cl, "send reconnect to export target mds.%d\n", i); mutex_unlock(&mdsc->mutex); - send_mds_reconnect(mdsc, s); + err =3D send_mds_reconnect(mdsc, s); + if (err) + pr_warn_client(cl, + "mds%d export target reconnect failed: %d\n", + i, err); ceph_put_mds_session(s); mutex_lock(&mdsc->mutex); } @@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *= con) { struct ceph_mds_session *s =3D con->private; struct ceph_mds_client *mdsc =3D s->s_mdsc; + int session_state; =20 pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n", s->s_mds); - if (READ_ONCE(mdsc->fsc->mount_state) !=3D CEPH_MOUNT_FENCE_IO && - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >=3D CEPH_MDS_STATE_REC= ONNECT) - send_mds_reconnect(mdsc, s); + + if (READ_ONCE(mdsc->fsc->mount_state) =3D=3D CEPH_MOUNT_FENCE_IO || + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONN= ECT) + return; + + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) =3D=3D CEPH_MDS_STATE_R= ECONNECT) { + int rc =3D send_mds_reconnect(mdsc, s); + + if (rc) + pr_warn_client(mdsc->fsc->client, + "mds%d reconnect failed: %d\n", + s->s_mds, rc); + return; + } + + /* + * MDS is active (past RECONNECT). It will not accept a + * CLIENT_RECONNECT from us, so tear the session down locally + * and let new requests re-open a fresh session. + * + * Snapshot session state with READ_ONCE, then revalidate under + * mdsc->mutex before acting. The subsequent mdsc->mutex + * section rechecks s_state to catch concurrent transitions, so + * the lockless snapshot here is safe. s->s_mutex is taken + * separately for cleanup after unregistration, which avoids + * introducing a new s->s_mutex + mdsc->mutex nesting. + */ + session_state =3D READ_ONCE(s->s_state); + + switch (session_state) { + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + mutex_lock(&mdsc->mutex); + if (s->s_mds >=3D mdsc->max_sessions || + mdsc->sessions[s->s_mds] !=3D s || + s->s_state !=3D session_state) { + pr_info_client(mdsc->fsc->client, + "mds%d state changed to %s during peer reset\n", + s->s_mds, + ceph_session_state_name(s->s_state)); + mutex_unlock(&mdsc->mutex); + return; + } + + ceph_get_mds_session(s); + s->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, s); + __wake_requests(mdsc, &s->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&s->s_mutex); + cleanup_session_requests(mdsc, s); + remove_session_caps(s); + mutex_unlock(&s->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, s->s_mds); + mutex_unlock(&mdsc->mutex); + + ceph_put_mds_session(s); + break; + case CEPH_MDS_SESSION_CLOSED: + case CEPH_MDS_SESSION_REJECTED: + break; + default: + pr_warn_client(mdsc->fsc->client, + "mds%d peer reset in unexpected state %s\n", + s->s_mds, + ceph_session_state_name(session_state)); + break; + } } =20 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg) @@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con,= struct ceph_msg *msg) =20 mutex_lock(&mdsc->mutex); if (__verify_registered_session(mdsc, s) < 0) { + doutc(cl, "dropping tid %llu from unregistered session %d\n", + le64_to_cpu(msg->hdr.tid), s->s_mds); mutex_unlock(&mdsc->mutex); goto out; } --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 627353FD12A for ; Wed, 29 Apr 2026 12:52:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467149; cv=none; b=VAaBiRsSYtXcfwPDgOcTYAWVGjPcmU3FWUMuJrJPCcHVgH2o1uGX40XrfH3klRtcbhhCwbpZ6NupEn+/B37sPbkQQX6MQ9g4s4BiynzA8wyHGrSimQi3VFnxRGxdt5WrsK1/B7BXG5fSCwiwuKtWeqad2diJZmzPxwkz4tTHFSI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467149; c=relaxed/simple; bh=XaOHlWWecmtb7BYdDb8myZ0lZCrnWCXCsZx1AwzhgEM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=TcHZ8mp+3TN/p1la7vZ4qbAeyBlNoLXLxDYd6WFu3I4lHha1wltSI16e2bS0BbOE1W0ZU2JuSdad/Nr1KVi6W75pq9LrG9Z0qiV0kcdzl8Fd2SmGLpF49ki5m3zyaqToxZCgd7+KEesH2T2xs5JGcq9PVKPeDav/svDO9uY3GpI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QSUCk/s0; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=CQZ29sps; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QSUCk/s0"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="CQZ29sps" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467146; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u+WWAgsIXWVgVMWDVgyTjAWpE9gW2mZV06D7tfpFocs=; b=QSUCk/s0H9BLd+953iZ3QNJ/avF63kZhNVuhTxQqKUQkMs6VNOicmurO0Y1n4BLcqJX44x tO18bWtEb1jMdRxvFP9aUNqFWxJpI4oPenMoQokDfwsCSUVyuwP4WzID/bBtd+ViiS6RKe UL9xxO8W4/OUMgVpDOlGDh97S9tANho= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-593-zd3VJifNNT2sidjaEtFLrw-1; Wed, 29 Apr 2026 08:52:23 -0400 X-MC-Unique: zd3VJifNNT2sidjaEtFLrw-1 X-Mimecast-MFC-AGG-ID: zd3VJifNNT2sidjaEtFLrw_1777467142 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-676fb54c0cdso6877971a12.2 for ; Wed, 29 Apr 2026 05:52:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467142; x=1778071942; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=u+WWAgsIXWVgVMWDVgyTjAWpE9gW2mZV06D7tfpFocs=; b=CQZ29spsrJSpWHpUlgJxSR8DTk86YZcSihZ3qz5QnspoAudmeXAfON55FzyLJfmqqK n/pfeELfCpAyxDVoag7V3hWpFGPzQ8M5y/EqIqn2bmvAvgm+VN4CfGqM6RaSv1t6f5II PDiwE54GYjFCB9Xy3xiF7sW7fGQYL9hlB0vHq8saOtuZndRFrfKkO5Z3NKSREyYeLWQ6 1fvmXh5L+a0Hg8KCYjPcSUP/fNC2Oe+nnSIUjODUAK8o/tvF0WsJ4mrOp28mF2CTlisU kt8ncMVgvgPZHIIqeXnEnS+xMkIVWnpguwVrr/xrC4PBaaWD6hLhLWiihJ5h5FN5uLRk EAEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467142; x=1778071942; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=u+WWAgsIXWVgVMWDVgyTjAWpE9gW2mZV06D7tfpFocs=; b=BfCt4LYb3FRrnN5malYnFXqFn98c4ms3kLGtbC/UhXXLQgp4vs2BiEmrWxtGDX4e/k /1Tsvyp2flFa5J9ErS0oEugrIV0XM0PHkQVxAK44rtTgAI8033GR1KiDlerchwW/4Kei mr2qKFsmJ+UppKV6Hd6Q6B+G7kDo458T70iaosEmCvtmTXz6aJfNQ5InT3iaQUFgib8V v3x3/vXfdrkNNUU0qjaCXwL7MMQbFGTGyljF2qxmFPU5OcV0ksJ3QRnPN0Jyo/d6U1tZ EFCahXecrGQN9ltp+ConLYBxNPkDzRddk37bz+Pm0iywn3f2pTryqD0xZ9R2RXFnXIBm fNaA== X-Gm-Message-State: AOJu0YzyAoxpBGfUpZRPvKgIGi4KTLaLQ2OqWnY0q1H4zUj8qONfG7// /IWba+CUbTRc+ZWjpLjP/V5m4Dje0ijQvus4cQ0rgTHi0Q7SH3mMTNshV3ZixYUVGRpYFloKktq +hHMmVzqvnqpVaASFg4PUcJMSWwCHogLR487pC4NubWGGr2XSI3z07VfEg1J9h9xf6Q== X-Gm-Gg: AeBDieu+iXU6S6VhHHfGXn3f9MXn9fdYZTcfEHPzpOqC2Sx7rLXXH6kuv7QOfwqaQbj dRQ60QagT1P6Jg5yyMtTGTWpW/zyOgw0lPEkRw5BlqOn5GQzdPY3l573T6Jtl8yLBgZ5yz5jISs pN/ZzrHkhnc4DBc5XU2D4V1f0AbwpgZSM4EpgeUhrrTw7zih0w0kJCP85q6jeaOzPMfg0IMaVbk b2Wzl3pgYuXsXFxAfLgkKUxz2wCalqSj0v6F7JJOrksleh7kZsN5ysI1PNB8FbAPtnKieqNV5++ +JOWK1FtPV1y5SU0pQ6nabcTSb7EzQsemzZhBMnGsdkp/mzCOknUmSUT6q3cP/rAZpNeatiMvcG 6jzXsaK8EJYwlyZxwgB7DDsBrYSFz4swy8FjNL6m2Sde+BJuWzZRqK04wD1Z3nqUMCg== X-Received: by 2002:a05:6402:324b:b0:679:223c:d195 with SMTP id 4fb4d7f45d1cf-679bb087506mr3036746a12.14.1777467141686; Wed, 29 Apr 2026 05:52:21 -0700 (PDT) X-Received: by 2002:a05:6402:324b:b0:679:223c:d195 with SMTP id 4fb4d7f45d1cf-679bb087506mr3036730a12.14.1777467141038; Wed, 29 Apr 2026 05:52:21 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:20 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Date: Wed, 29 Apr 2026 12:51:59 +0000 Message-Id: <20260429125206.1512203-5-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Convert wait_caps_flush() from a silent indefinite wait into a diagnostic wait loop that periodically dumps pending cap flush state. The underlying wait semantics remain intact: callers still wait until the requested cap flushes complete. The difference is that long stalls now produce actionable diagnostics instead of looking like a silent hang. CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways: it limits the number of entries emitted per diagnostic dump, and it limits the number of timed diagnostic dumps before the wait continues silently. When more entries exist than the per-dump limit, a truncation count is reported. When the dump iteration limit is reached, a final suppression message is emitted so the transition to silence is explicit. The diagnostic dump collects flush entry data under cap_dirty_lock into a bounded on-stack array, then prints after releasing the lock. This avoids holding the spinlock across printk calls. A null cf->ci on the global flush list indicates a bug since all cap_flush entries are initialized with a valid ci before being added. Signal this with WARN_ON_ONCE while still printing enough context for debugging. READ_ONCE is used for the i_last_cap_flush_ack field, which is read outside the inode lock domain. Flush tids are monotonically increasing and acks are processed in order under i_ceph_lock, so the latest ack tid is always the most recently written value. Add a ci pointer to struct ceph_cap_flush so that the diagnostic dump can identify which inode each pending flush belongs to. The new i_last_cap_flush_ack field tracks the latest acknowledged flush tid per inode for diagnostic correlation. This improves reset-drain observability and is also useful for existing sync and writeback troubleshooting paths. Signed-off-by: Alex Markuze --- fs/ceph/caps.c | 10 +++++ fs/ceph/inode.c | 1 + fs/ceph/mds_client.c | 97 ++++++++++++++++++++++++++++++++++++++++++-- fs/ceph/super.h | 6 +++ 4 files changed, 110 insertions(+), 4 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index cb9e78b713d9..4b37d9ffdf7f 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info= *ci, =20 spin_lock(&mdsc->cap_dirty_lock); capsnap->cap_flush.tid =3D ++mdsc->last_cap_flush_tid; + capsnap->cap_flush.ci =3D ci; list_add_tail(&capsnap->cap_flush.g_list, &mdsc->cap_flush_list); if (oldest_flush_tid =3D=3D 0) @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void) return NULL; =20 cf->is_capsnap =3D false; + cf->ci =3D NULL; return cf; } =20 @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode, doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode)); =20 swap(cf, ci->i_prealloc_cap_flush); + cf->ci =3D ci; cf->caps =3D flushing; cf->wake =3D wake; =20 @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode= , u64 flush_tid, bool wake_ci =3D false; bool wake_mdsc =3D false; =20 + /* + * Flush tids are monotonically increasing and acks arrive in + * order under i_ceph_lock, so this is always the latest tid. + * Diagnostic readers use READ_ONCE() without holding the lock. + */ + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid); + list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) { /* Is this the one that was flushed? */ if (cf->tid =3D=3D flush_tid) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index f75d66760d54..de465c7e96e8 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ci->i_cap_snaps); ci->i_head_snapc =3D NULL; ci->i_snap_caps =3D 0; + ci->i_last_cap_flush_ack =3D 0; =20 ci->i_last_rd =3D ci->i_last_wr =3D jiffies - 3600 * HZ; for (i =3D 0; i < CEPH_FILE_MODE_BITS; i++) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index b62abae72c4c..d83003acfb06 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -27,6 +27,8 @@ #include =20 #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE) +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60 +#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5 =20 /* * A cluster of MDS (metadata server) daemons is responsible for @@ -2286,19 +2288,106 @@ static int check_caps_flush(struct ceph_mds_client= *mdsc, } =20 /* - * flush all dirty inode data to disk. + * Dump pending cap flushes for diagnostic purposes. * - * returns true if we've flushed through want_flush_tid + * cf->ci is safe to dereference here: cap_flush entries hold a + * reference on the inode (via the cap), and entries are removed from + * cap_flush_list under cap_dirty_lock before the cap (and thus the + * inode reference) is released. Holding cap_dirty_lock therefore + * guarantees the inode remains valid for the lifetime of the scan. + */ +struct flush_dump_entry { + u64 ino; + u64 snap; + int caps; + u64 tid; + u64 last_ack; + bool wake; + bool is_capsnap; + bool ci_null; +}; + +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid) +{ + struct ceph_client *cl =3D mdsc->fsc->client; + struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_COUNT]; + struct ceph_cap_flush *cf; + int n =3D 0, remaining =3D 0; + + spin_lock(&mdsc->cap_dirty_lock); + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) { + if (cf->tid > want_tid) + break; + if (n < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) { + struct flush_dump_entry *e =3D &entries[n++]; + + e->ci_null =3D WARN_ON_ONCE(!cf->ci); + if (!e->ci_null) { + e->ino =3D ceph_ino(&cf->ci->netfs.inode); + e->snap =3D ceph_snap(&cf->ci->netfs.inode); + e->last_ack =3D READ_ONCE(cf->ci->i_last_cap_flush_ack); + } + e->caps =3D cf->caps; + e->tid =3D cf->tid; + e->wake =3D cf->wake; + e->is_capsnap =3D cf->is_capsnap; + } else { + remaining++; + } + } + spin_unlock(&mdsc->cap_dirty_lock); + + pr_info_client(cl, "still waiting for cap flushes through %llu:\n", + want_tid); + for (int i =3D 0; i < n; i++) { + struct flush_dump_entry *e =3D &entries[i]; + + if (e->ci_null) + pr_info_client(cl, + " (null ci) %s tid=3D%llu wake=3D%d%s\n", + ceph_cap_string(e->caps), e->tid, + e->wake, + e->is_capsnap ? " is_capsnap" : ""); + else + pr_info_client(cl, + " %llx.%llx %s tid=3D%llu last_ack=3D%llu wake=3D%d%s\n", + e->ino, e->snap, + ceph_cap_string(e->caps), e->tid, + e->last_ack, e->wake, + e->is_capsnap ? " is_capsnap" : ""); + } + if (remaining) + pr_info_client(cl, " ... and %d more pending flushes\n", + remaining); +} + +/* + * Wait for all cap flushes through @want_flush_tid to complete. + * Periodically dumps pending cap flush state for diagnostics. */ static void wait_caps_flush(struct ceph_mds_client *mdsc, u64 want_flush_tid) { struct ceph_client *cl =3D mdsc->fsc->client; + int i =3D 0; + long ret; =20 doutc(cl, "want %llu\n", want_flush_tid); =20 - wait_event(mdsc->cap_flushing_wq, - check_caps_flush(mdsc, want_flush_tid)); + do { + /* 60 * HZ fits in a long on all supported architectures. */ + ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, want_flush_tid), + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ); + if (ret =3D=3D 0) { + if (i < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) + dump_cap_flushes(mdsc, want_flush_tid); + else if (i =3D=3D CEPH_CAP_FLUSH_MAX_DUMP_COUNT) + pr_info_client(cl, + "still waiting for cap flushes; suppressing further dumps\n"); + i++; + } + } while (ret =3D=3D 0); =20 doutc(cl, "ok, flushed thru %llu\n", want_flush_tid); } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 30911ccf961e..9aca42c89ea0 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -238,6 +238,7 @@ struct ceph_cap_flush { bool is_capsnap; /* true means capsnap */ struct list_head g_list; // global struct list_head i_list; // per inode + struct ceph_inode_info *ci; }; =20 /* @@ -443,6 +444,11 @@ struct ceph_inode_info { struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or dirty|flushing caps */ unsigned i_snap_caps; /* cap bits for snapped files */ + /* + * Written under i_ceph_lock, read via READ_ONCE() + * from diagnostic paths. + */ + u64 i_last_cap_flush_ack; =20 unsigned long i_last_rd; unsigned long i_last_wr; --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A2A3FD15E for ; Wed, 29 Apr 2026 12:52:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467152; cv=none; b=G/rFCwnRCUw5lFpChtWmWAw1JHbACtxATDbEhI7WD8OcWsy9kKV5DQYWY7PirUZdvoyhrYDrI+mDpemxj5O6sWvkaTp+2IRb1KwZVpWWQ5G1z99j7De0tcpiNMf3D2O6GoAVl6FFhjG1UWOYgNkYpfWCvB5Dg7nhxkLE3lKzba8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467152; c=relaxed/simple; bh=VSfRPwYNnPEUZikZoNm4zMsHgJWcrfVAtA2TiNYyoJg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=MdEsBPinjuwJB96ZS8n12dN+8YLgNus5KSp3Jrsdn05qvMB4nC0V3mMttYFC4MdioqIlLt9nO1B2dUNcAY17bFwDHWCExu/dPbNq3Ycax8zSygRJ1RnWoNz5HiU1hV5/+IuojeZxrMWn/i3eeM2XG3GVtjONDXiSV1SOXxC+w2E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Xyo2eEHt; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=EoNUeUjm; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Xyo2eEHt"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="EoNUeUjm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467147; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FUBqP/w3kgWKoRXQ2dx3j370SnRMMM6PxY6RhRVFM9Q=; b=Xyo2eEHtDmwxUcSrLIw4oQFpXdWuv/Fs5+XmejHY0RHopYNzgQNOWCjxCAzEFEdPvqiMIu /Duz0uezOZQUUO1mplegUEyX29y581I3z3M9YFkSmZNJCUEqw1MBpaflp2rPrTSZwqacaD JIe8Ko/bbKnDolyyNgbyeXREB7HO4XY= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-629-BwfoLD9HN56RTIMP3PW4nQ-1; Wed, 29 Apr 2026 08:52:26 -0400 X-MC-Unique: BwfoLD9HN56RTIMP3PW4nQ-1 X-Mimecast-MFC-AGG-ID: BwfoLD9HN56RTIMP3PW4nQ_1777467145 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-b8f5bce308dso1182065666b.1 for ; Wed, 29 Apr 2026 05:52:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467145; x=1778071945; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FUBqP/w3kgWKoRXQ2dx3j370SnRMMM6PxY6RhRVFM9Q=; b=EoNUeUjmk6+tUR7uMWuTR0xEiqW2vR1qTu3BYTEWpjQNtf9aeXo3f/e9Rmo+WuLKQT F5un6nzRP1KfrROx4BCG3RNhdIFkRu+J9HSZWxW/mA+TJtITtrq3BBfD9+0GQlj3A9B1 agrjuuSPXr++uLAdpAx32CyUqjI/5A4427e540ZKkBsTEs8cwF+NjvxZ33ctiVIwBmKD 7tW055H71akt5VonFmWCoZIGIn2TH9xfOh5BQPw4xzBytbrT4JXP3TernkwjRmd6GLD+ luVNG8gWB/e3cO0QPLF9st7lmll2mXpiYF1mKeAT3sC0v26vwMpuX1KgZRGCe/Cnjoiy WM4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467145; x=1778071945; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FUBqP/w3kgWKoRXQ2dx3j370SnRMMM6PxY6RhRVFM9Q=; b=hAuk2aJ+Nv8w/hs+A31p52DFQNbSkm6MB3Q7pFTsIPOsJZ6TJr2CL2JryxsLhddz6V mpJoZm3xhDeVMHSoNuk+HgFugUjBotZbTmR/t+j1xPi1UxzD+VutEZc1m6tCSxNISga+ q8LKuG1aBYLuaYExa68wkbP9Z4pg4TyQCesUhLc7+NiUURDpE6dfcETAOd7vdoyzA6Ye BkNQE7uPTP/EgiIkFYyP5YVMi3zCxtgNDzKYqYcfT4lNVuljOttW6Xg1/7KyOUFkjDcA y70W6qVw3R44xrZjTWBbb84X7wwYKYNKmVNi6Z0Wb7CanqxX63fgAXdOwaOwcTMiOKsY CYng== X-Gm-Message-State: AOJu0Ywpf01Ro/ghRKPbp86S2Uz+AsVwohgs9fZVDfXjJ4ap14ZyIovf BEnzifP7FiJrZSrvAIdTzz57WRYVmmGeESDNq1zyjsVnWdN7B/OUTsRX8peYAASdwD7mxZE9pAW xnZGRu/ccIoJ1rPYNwbiscnvCg3rMpLB8B+A+HiXFGCKEvZssUqPOmkuEcR7XvayfNQ== X-Gm-Gg: AeBDietg0eST6IeQKK2rkySat/2WNzaGVJZkdfT55hVTj8Dfd3EbN0/lMsed51lq9SO jV5/ReNrFwQzHNNhfh8PLDsbEQaQCymAaFw1yxburj4lpCpmoZEnXezFkWe34annmZbFxQeZrhn qvRuleMHeYIHrzFnmlb94fIhPTpEm6e9wSclKahIVh+ROoIrV17EPNEx7SA9uIQO7VYx5ALdRjI bXLna3yvjmthCRMI6lupyy1Oyl0WP0NP8xyQBOECbGJI6AxETK3AOyllQpvHeAMEURls+Z9FppL SPggJtoRzI++R1Cy4iCyt5fIsUI5CwO59vmeqcANDNt98kpNlUjuLQUr/49F85W+hRb9v4c4ROu TTe+j98pmfqIi4VJMMJXcSJtXN32KTghvo7WRT71aCuUWfwbRxcSRkzk1Jqy4tRbALA== X-Received: by 2002:a17:907:3e9b:b0:b9d:e301:20db with SMTP id a640c23a62f3a-bb80472800emr505170666b.25.1777467144253; Wed, 29 Apr 2026 05:52:24 -0700 (PDT) X-Received: by 2002:a17:907:3e9b:b0:b9d:e301:20db with SMTP id a640c23a62f3a-bb80472800emr505167866b.25.1777467143689; Wed, 29 Apr 2026 05:52:23 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:22 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 05/11] ceph: add client reset state machine and session teardown Date: Wed, 29 Apr 2026 12:52:00 +0000 Message-Id: <20260429125206.1512203-6-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result via -EIO (reset failed) or 0 (success). Internal work-function errors such as -ENOMEM are not propagated to unrelated callers like open() or flock(); the detailed error remains in debugfs and tracepoints. The work function checks st->shutdown before each phase transition (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not overwritten. If destroy already took ownership, the work function releases session references and returns without touching the state. The timeout calculation for blocked-request waiters uses max_t() to prevent jiffies underflow when the deadline has already passed. The close-grace sleep before teardown is a best-effort nudge to let queued REQUEST_CLOSE messages egress; it is not a correctness requirement since the MDS still has session_autoclose as a fallback. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze --- fs/ceph/locks.c | 16 ++ fs/ceph/mds_client.c | 455 +++++++++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.h | 42 ++++ 3 files changed, 513 insertions(+) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index c4ff2266bb94..677221bd64e0 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_l= ock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u16 op =3D CEPH_MDS_OP_SETFILELOCK; @@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_= lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (op =3D=3D CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (lock_is_read(fl)) lock_cmd =3D CEPH_LOCK_SHARED; else if (lock_is_write(fl)) @@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_= lock *fl) { struct inode *inode =3D file_inode(file); struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_client *cl =3D ceph_inode_to_client(inode); int err =3D 0; u8 wait =3D 0; @@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file= _lock *fl) return -EIO; } =20 + /* Wait for reset to complete before acquiring new locks */ + if (!lock_is_unlock(fl)) { + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) + return err; + } + if (IS_SETLKW(cmd)) wait =3D 1; =20 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index d83003acfb06..777af51ec8d8 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -67,6 +68,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc, struct list_head *head); static void ceph_cap_release_work(struct work_struct *work); static void ceph_cap_reclaim_work(struct work_struct *work); +static void ceph_mdsc_reset_workfn(struct work_struct *work); =20 static const struct ceph_connection_operations mds_con_ops; =20 @@ -3797,6 +3799,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client = *mdsc, struct inode *dir, struct ceph_client *cl =3D mdsc->fsc->client; int err =3D 0; =20 + /* + * If a reset is in progress, wait for it to complete. + * + * This is best-effort: a request can pass this check just + * before the phase leaves IDLE and proceed concurrently with + * reset. That is acceptable because (a) such requests will + * either complete normally or fail and be retried by the + * caller, and (b) adding lock serialization here would + * penalize every request for a rare manual operation. + */ + err =3D ceph_mdsc_wait_for_reset(mdsc); + if (err) { + doutc(cl, "wait_for_reset failed: %d\n", err); + return err; + } + /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */ if (req->r_inode) ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN); @@ -5203,6 +5221,421 @@ static int send_mds_reconnect(struct ceph_mds_clien= t *mdsc, return err; } =20 +const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase) +{ + switch (phase) { + case CEPH_CLIENT_RESET_IDLE: return "idle"; + case CEPH_CLIENT_RESET_QUIESCING: return "quiescing"; + case CEPH_CLIENT_RESET_DRAINING: return "draining"; + case CEPH_CLIENT_RESET_TEARDOWN: return "teardown"; + default: return "unknown"; + } +} + +/** + * ceph_mdsc_wait_for_reset - wait for an active reset to complete + * @mdsc: MDS client + * + * Returns 0 if reset completed successfully or no reset was active. + * Returns -EIO if reset completed with an error. + * Returns -ETIMEDOUT if we timed out waiting. + * Returns -ERESTARTSYS if interrupted by signal. + * + * Internal work-function errors (e.g. -ENOMEM) are not propagated + * to callers; they are mapped to -EIO. The detailed error is + * available via debugfs status and tracepoints. + */ +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + unsigned long deadline =3D jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC *= HZ; + int blocked_count; + long remaining; + long wait_ret; + int ret; + + if (READ_ONCE(st->phase) =3D=3D CEPH_CLIENT_RESET_IDLE) + return 0; + + blocked_count =3D atomic_inc_return(&st->blocked_requests); + doutc(cl, "request blocked during reset, %d total blocked\n", + blocked_count); + +retry: + remaining =3D max_t(long, deadline - jiffies, 1); + wait_ret =3D wait_event_interruptible_timeout(st->blocked_wq, + READ_ONCE(st->phase) =3D=3D + CEPH_CLIENT_RESET_IDLE, + remaining); + + if (wait_ret =3D=3D 0) { + atomic_dec(&st->blocked_requests); + pr_warn_client(cl, "timed out waiting for reset to complete\n"); + return -ETIMEDOUT; + } + if (wait_ret < 0) { + atomic_dec(&st->blocked_requests); + return (int)wait_ret; /* -ERESTARTSYS */ + } + + /* + * Verify phase is still IDLE under the lock. If another reset + * was scheduled between the wake-up and this check, loop back + * and wait for it to finish rather than returning a stale result. + */ + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + if (time_before(jiffies, deadline)) + goto retry; + atomic_dec(&st->blocked_requests); + return -ETIMEDOUT; + } + ret =3D st->last_errno; + spin_unlock(&st->lock); + + atomic_dec(&st->blocked_requests); + return ret ? -EIO : 0; +} + +static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + /* + * If destroy already marked us as shut down, it owns the + * final bookkeeping and waiter wakeup. Just bail so we + * don't overwrite its state. + */ + if (st->shutdown) { + spin_unlock(&st->lock); + return; + } + st->last_finish =3D jiffies; + st->last_errno =3D ret; + st->phase =3D CEPH_CLIENT_RESET_IDLE; + if (ret) + st->failure_count++; + else + st->success_count++; + spin_unlock(&st->lock); + + /* Wake up all requests that were blocked waiting for reset */ + wake_up_all(&st->blocked_wq); +} + +static void ceph_mdsc_reset_workfn(struct work_struct *work) +{ + struct ceph_mds_client *mdsc =3D + container_of(work, struct ceph_mds_client, reset_work); + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_client *cl =3D mdsc->fsc->client; + struct ceph_mds_session **sessions =3D NULL; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + int max_sessions, i, n =3D 0, torn_down =3D 0; + int ret =3D 0; + + spin_lock(&st->lock); + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + mutex_lock(&mdsc->mutex); + max_sessions =3D mdsc->max_sessions; + if (max_sessions <=3D 0) { + mutex_unlock(&mdsc->mutex); + goto out_complete; + } + + sessions =3D kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL); + if (!sessions) { + mutex_unlock(&mdsc->mutex); + ret =3D -ENOMEM; + pr_err_client(cl, + "manual session reset failed to allocate session array\n"); + ceph_mdsc_reset_complete(mdsc, ret); + return; + } + + for (i =3D 0; i < max_sessions; i++) { + struct ceph_mds_session *session =3D mdsc->sessions[i]; + + if (!session) + continue; + + /* + * Read session state without s_mutex to avoid nesting + * mdsc->mutex -> s_mutex, which would invert the + * s_mutex -> mdsc->mutex order used by + * cleanup_session_requests(). s_state is an int + * so loads are atomic; the teardown loop below + * handles races with concurrent state transitions. + */ + switch (READ_ONCE(session->s_state)) { + case CEPH_MDS_SESSION_OPEN: + case CEPH_MDS_SESSION_HUNG: + case CEPH_MDS_SESSION_OPENING: + case CEPH_MDS_SESSION_RESTARTING: + case CEPH_MDS_SESSION_RECONNECTING: + case CEPH_MDS_SESSION_CLOSING: + sessions[n++] =3D ceph_get_mds_session(session); + break; + default: + pr_info_client(cl, + "mds%d in state %s, skipping reset\n", + session->s_mds, + ceph_session_state_name(session->s_state)); + break; + } + } + mutex_unlock(&mdsc->mutex); + + pr_info_client(cl, + "manual session reset executing (sessions=3D%d, reason=3D\"%s\")\= n", + n, reason); + + if (n =3D=3D 0) { + kfree(sessions); + goto out_complete; + } + + spin_lock(&st->lock); + if (st->shutdown) { + spin_unlock(&st->lock); + goto out_sessions; + } + st->phase =3D CEPH_CLIENT_RESET_DRAINING; + spin_unlock(&st->lock); + + /* + * Best-effort drain: flush dirty state while sessions are still + * alive. New requests are blocked while phase !=3D IDLE. + * The sessions are functional, so non-stuck state drains normally. + * Stuck state (the cause of the stalemate the operator is trying + * to break) will not drain -- that is expected, and we proceed to + * forced teardown after the timeout. + * + * Three things are kicked off: + * 1. MDS journal -- send_flush_mdlog asks each MDS to journal + * pending unsafe operations (creates, renames, setattrs). + * This is best-effort: we do not wait for individual unsafe + * requests to reach safe status. Non-stuck ops typically + * complete within the bounded wait window below; stuck ops + * will not, and are force-dropped during teardown. + * 2. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on + * all sessions. Non-stuck caps flush in milliseconds. + * 3. Cap releases -- push pending cap release messages. + * + * The cap-flush wait below provides the bounded drain window + * during which all three categories can make progress. + */ + for (i =3D 0; i < n; i++) + send_flush_mdlog(sessions[i]); + + ceph_flush_dirty_caps(mdsc); + ceph_flush_cap_releases(mdsc); + + spin_lock(&mdsc->cap_dirty_lock); + if (!list_empty(&mdsc->cap_flush_list)) { + struct ceph_cap_flush *cf =3D + list_last_entry(&mdsc->cap_flush_list, + struct ceph_cap_flush, g_list); + u64 want_flush =3D mdsc->last_cap_flush_tid; + long drain_ret; + + /* + * Setting wake on the last entry is sufficient: flush + * entries complete in order, so when this entry finishes + * all earlier ones are already done. + */ + cf->wake =3D true; + spin_unlock(&mdsc->cap_dirty_lock); + pr_info_client(cl, + "draining (want_flush=3D%llu, %d sessions)\n", + want_flush, n); + drain_ret =3D wait_event_timeout(mdsc->cap_flushing_wq, + check_caps_flush(mdsc, + want_flush), + CEPH_CLIENT_RESET_DRAIN_SEC * HZ); + if (drain_ret =3D=3D 0) { + pr_info_client(cl, + "drain timed out, proceeding with forced teardown\n"); + spin_lock(&st->lock); + st->drain_timed_out =3D true; + spin_unlock(&st->lock); + } else { + pr_info_client(cl, "drain completed successfully\n"); + spin_lock(&st->lock); + st->drain_timed_out =3D false; + spin_unlock(&st->lock); + } + } else { + spin_unlock(&mdsc->cap_dirty_lock); + spin_lock(&st->lock); + st->drain_timed_out =3D false; + spin_unlock(&st->lock); + } + + spin_lock(&st->lock); + if (st->shutdown) { + spin_unlock(&st->lock); + goto out_sessions; + } + st->phase =3D CEPH_CLIENT_RESET_TEARDOWN; + spin_unlock(&st->lock); + + /* + * Ask each MDS to close the session before we tear it down + * locally. Without this the MDS sees only a connection drop and + * waits for the client to reconnect (up to session_autoclose + * seconds) before evicting the session and releasing locks. + * + * Reuse the normal close machinery so the session state/sequence + * snapshot is serialized under s_mutex and a racing s_seq bump + * retransmits REQUEST_CLOSE while the session remains CLOSING. + * We send all close requests first, then yield briefly to let the + * network stack transmit them before __unregister_session() + * closes the connections. + */ + for (i =3D 0; i < n; i++) { + int err; + + mutex_lock(&sessions[i]->s_mutex); + err =3D __close_session(mdsc, sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + if (err < 0) + pr_warn_client(cl, + "mds%d failed to queue close request before reset: %d\n", + sessions[i]->s_mds, err); + } + /* + * Best-effort grace period: yield briefly so the network stack + * can transmit the queued REQUEST_CLOSE messages before we tear + * down connections. Not a correctness requirement -- the MDS + * will still evict via session_autoclose if it never receives + * the close request. + */ + if (n > 0) + msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS); + + /* + * Tear down each session: close the connection, remove all + * caps, clean up requests, then kick pending requests so they + * re-open a fresh session on the next attempt. + * + * This is modeled on the check_new_map() forced-close path + * for stopped MDS ranks - a proven pattern for hard session + * teardown. We do NOT attempt send_mds_reconnect() because + * the MDS only accepts reconnects during its own RECONNECT + * phase (after MDS restart), not from an active client. + * + * Any state that did not drain (caps that didn't flush, unsafe + * requests that the MDS didn't journal) is force-dropped here. + * This is intentional: that state is stuck and is the reason + * the operator triggered the reset. + */ + for (i =3D 0; i < n; i++) { + int mds =3D sessions[i]->s_mds; + + pr_info_client(cl, "mds%d resetting session\n", mds); + + mutex_lock(&mdsc->mutex); + if (mds >=3D mdsc->max_sessions || + mdsc->sessions[mds] !=3D sessions[i]) { + pr_info_client(cl, + "mds%d session already torn down, skipping\n", + mds); + mutex_unlock(&mdsc->mutex); + ceph_put_mds_session(sessions[i]); + continue; + } + sessions[i]->s_state =3D CEPH_MDS_SESSION_CLOSED; + __unregister_session(mdsc, sessions[i]); + __wake_requests(mdsc, &sessions[i]->s_waiting); + mutex_unlock(&mdsc->mutex); + + mutex_lock(&sessions[i]->s_mutex); + cleanup_session_requests(mdsc, sessions[i]); + remove_session_caps(sessions[i]); + mutex_unlock(&sessions[i]->s_mutex); + + wake_up_all(&mdsc->session_close_wq); + + ceph_put_mds_session(sessions[i]); + + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, mds); + mutex_unlock(&mdsc->mutex); + + torn_down++; + pr_info_client(cl, "mds%d session reset complete\n", mds); + } + + kfree(sessions); + + spin_lock(&st->lock); + st->sessions_reset =3D torn_down; + spin_unlock(&st->lock); + +out_complete: + ceph_mdsc_reset_complete(mdsc, ret); + return; + +out_sessions: + for (i =3D 0; i < n; i++) + ceph_put_mds_session(sessions[i]); + kfree(sessions); +} + +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason) +{ + struct ceph_client_reset_state *st =3D &mdsc->reset_state; + struct ceph_fs_client *fsc =3D mdsc->fsc; + const char *msg =3D (reason && reason[0]) ? reason : "manual"; + int mount_state; + + mount_state =3D READ_ONCE(fsc->mount_state); + if (mount_state !=3D CEPH_MOUNT_MOUNTED) { + pr_warn_client(fsc->client, + "reset rejected: mount_state=3D%d (not mounted)\n", + mount_state); + return -EINVAL; + } + + spin_lock(&st->lock); + if (st->phase !=3D CEPH_CLIENT_RESET_IDLE) { + spin_unlock(&st->lock); + return -EBUSY; + } + + st->phase =3D CEPH_CLIENT_RESET_QUIESCING; + st->last_start =3D jiffies; + st->last_errno =3D 0; + st->drain_timed_out =3D false; + st->sessions_reset =3D 0; + st->trigger_count++; + strscpy(st->last_reason, msg, sizeof(st->last_reason)); + spin_unlock(&st->lock); + + if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) { + spin_lock(&st->lock); + st->phase =3D CEPH_CLIENT_RESET_IDLE; + st->last_errno =3D -EALREADY; + st->last_finish =3D jiffies; + st->failure_count++; + spin_unlock(&st->lock); + wake_up_all(&st->blocked_wq); + return -EALREADY; + } + + pr_info_client(mdsc->fsc->client, + "manual session reset scheduled (reason=3D\"%s\")\n", + msg); + return 0; +} + =20 /* * compare old and new mdsmaps, kicking requests @@ -5742,6 +6175,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) INIT_LIST_HEAD(&mdsc->dentry_leases); INIT_LIST_HEAD(&mdsc->dentry_dir_leases); =20 + spin_lock_init(&mdsc->reset_state.lock); + init_waitqueue_head(&mdsc->reset_state.blocked_wq); + atomic_set(&mdsc->reset_state.blocked_requests, 0); + INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn); + ceph_caps_init(mdsc); ceph_adjust_caps_max_min(mdsc, fsc->mount_options); =20 @@ -6267,6 +6705,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc) /* flush out any connection work with references to us */ ceph_msgr_flush(); =20 + /* + * Mark reset as failed and wake any blocked waiters before + * cancelling, so unmount doesn't stall on blocked_wq timeout + * if cancel_work_sync() prevents the work from running. + */ + spin_lock(&mdsc->reset_state.lock); + mdsc->reset_state.shutdown =3D true; + if (mdsc->reset_state.phase !=3D CEPH_CLIENT_RESET_IDLE) { + mdsc->reset_state.phase =3D CEPH_CLIENT_RESET_IDLE; + mdsc->reset_state.last_errno =3D -ESHUTDOWN; + mdsc->reset_state.last_finish =3D jiffies; + mdsc->reset_state.failure_count++; + } + spin_unlock(&mdsc->reset_state.lock); + wake_up_all(&mdsc->reset_state.blocked_wq); + + cancel_work_sync(&mdsc->reset_work); ceph_mdsc_stop(mdsc); =20 ceph_metric_destroy(&mdsc->metric); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index e91a199d56fd..afc08b0abbd5 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -74,6 +74,42 @@ struct ceph_fs_client; struct ceph_cap; =20 #define MDS_AUTH_UID_ANY -1 +#define CEPH_CLIENT_RESET_REASON_LEN 64 +#define CEPH_CLIENT_RESET_DRAIN_SEC 5 +#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100 +#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120 + +enum ceph_client_reset_phase { + CEPH_CLIENT_RESET_IDLE =3D 0, + /* + * QUIESCING is set synchronously by schedule_reset() before the + * workqueue item is dispatched. It gates new requests (any + * phase !=3D IDLE blocks callers) during the window between + * scheduling and the work function's transition to DRAINING. + */ + CEPH_CLIENT_RESET_QUIESCING, + CEPH_CLIENT_RESET_DRAINING, + CEPH_CLIENT_RESET_TEARDOWN, +}; + +struct ceph_client_reset_state { + spinlock_t lock; + u64 trigger_count; + u64 success_count; + u64 failure_count; + unsigned long last_start; + unsigned long last_finish; + int last_errno; + enum ceph_client_reset_phase phase; + bool drain_timed_out; + bool shutdown; + int sessions_reset; + char last_reason[CEPH_CLIENT_RESET_REASON_LEN]; + + /* Request blocking during reset */ + wait_queue_head_t blocked_wq; + atomic_t blocked_requests; +}; =20 struct ceph_mds_cap_match { s64 uid; /* default to MDS_AUTH_UID_ANY */ @@ -536,6 +572,8 @@ struct ceph_mds_client { struct list_head dentry_dir_leases; /* lru list */ =20 struct ceph_client_metric metric; + struct work_struct reset_work; + struct ceph_client_reset_state reset_state; =20 spinlock_t snapid_map_lock; struct rb_root snapid_map_tree; @@ -559,10 +597,14 @@ extern struct ceph_mds_session * __ceph_lookup_mds_session(struct ceph_mds_client *, int mds); =20 extern const char *ceph_session_state_name(int s); +extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phas= e); =20 extern struct ceph_mds_session * ceph_get_mds_session(struct ceph_mds_session *s); extern void ceph_put_mds_session(struct ceph_mds_session *s); +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc, + const char *reason); +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc); =20 extern int ceph_mdsc_init(struct ceph_fs_client *fsc); extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc); --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08A743FE665 for ; Wed, 29 Apr 2026 12:52:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467155; cv=none; b=tDt0ypiH0nKyUFq3cmULT9ZFHhnK7Us6ZUcCl41ACWn7T03tk880lYk3jXann1EFfKJIJCYhd3M7/dmf8VyM5l7jSJvMaIOjmWhiQW81B5gFjECSaHgjgAq5JsC+4ryAsZqbzwgdcmqG4CKZvIscHhvxP4iEjj5xQS3Foq2crdw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467155; c=relaxed/simple; bh=+BN8CG8/6Huyk/ldf4iRS8CZs2L9Jd4Vqs8JTyQYUc8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=OhwT2p25jdhHzG/+dmhPwHoTeJFWiQCt4mGUd5bb9xYsVNV+1BIk7WvNt0rtVbaNAI+A/EqjVAfmxhVQs+gb0wNPxWLKybzEanueDZ46R9DreaSPaBscApwh6/oVk8qvVsrX3gpgFWBzIiWrpeLhsAu4CIUQan9Ka/VE80CONlU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FE+EXqgb; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=pRCxMVEl; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FE+EXqgb"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="pRCxMVEl" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467150; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PGujJ4ac8Vzv3QrLWGifn3d8agUct8OsMz9/77yTtWw=; b=FE+EXqgbTH2/TjUg8WKQo+xPyvxPfg3606kl2/n8AR/Gxx6LpRs1U92PXU1I9kiLljTUyq WDZHPJF+Q+jrGJxPjtNwkud8aMI3wO4PvCZ2GV2fWemcxX08lffXOUmjsAE0gv8wb2N273 Ys44Rdg8QPo/hHtdSTIcL8X47ZQXsxM= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-332-gziIT5IyOrOa-ytWd8tXeQ-1; Wed, 29 Apr 2026 08:52:28 -0400 X-MC-Unique: gziIT5IyOrOa-ytWd8tXeQ-1 X-Mimecast-MFC-AGG-ID: gziIT5IyOrOa-ytWd8tXeQ_1777467147 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-66c165a7a8cso11538290a12.1 for ; Wed, 29 Apr 2026 05:52:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467147; x=1778071947; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PGujJ4ac8Vzv3QrLWGifn3d8agUct8OsMz9/77yTtWw=; b=pRCxMVElkt1D2HjWD+tMCWsSp7gVa6h+FI8xgHAB9zcHiw86uSG7TG//SeKYV1QPd5 ezwqz8qTkJPPGN4XBYV2qC67ZlsqesE1gjeNMcBdatJUJbP/uGzJnc3RrxLat32gnqmt 0uFHzYLSv8fuEDhIGHkJkstAR3HSIMAjKcdJR3negS8gKkvlHBRjNJchU7qGAhKAnRc6 6Ncfmr2jGZClGLHOgfhIm1zs4x6tnUPNdoxNUAdLmE4RcyLt0Kv9coStVZwPr2Xt96xK OXY88bqceKfG68sv3dy8dS9SJeeAXPFjHFncqP8QH4v48wi3XRnJuqIgX4pPPdEg4JPU qOBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467147; x=1778071947; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=PGujJ4ac8Vzv3QrLWGifn3d8agUct8OsMz9/77yTtWw=; b=JINTGlJhFk1iPfNz25iU1mlHoiQa3WA3+yAYyBY+77KVlHUtTmuiel/wc/vKcSDpUj zNUVvGzZswPzL0WJE5pn/nm+5FoEXDJnWEJ1s0rNUtcvbv7HB4X2Obj1eriqwzaW1oLu zEJg6YaGPMgTF/IUY++Y3+fUgxj9XQQQ3V+EioaVAZ4bkcV12MnsNXuch7wcpgFw/LAf 0OGFSw7GuZcyHoHRa/sD5sDuMp0iJ4hThxveHMuGoaeY3evB3IrtY0h3JSu7Hc7j7sWN raY7cJVwjUZtj6hjsYuumrMweP7LrKx5O5NtKpIfgFhsm1yG3BC1Z2ofqJew+CpniZBi v56w== X-Gm-Message-State: AOJu0YzP/AVnYUeuehe5qCQ23WDUk4+yFsDhOdi5xKCXHBwl3q/Tg7vp 8RZ6xWsq3/A25uMkXnfeeLTopXDlkvKcv8S1nd8AsOo4VbwA3tJvfAOXfonjPumFwt8lHR6fI2i bCwgSocP0YCXcCbmC4GACsRJcTQkv6MLvMI6SacveSEdEwHTuaG1boXDwM/GHupRLrQ== X-Gm-Gg: AeBDieuixdpE4Rx5md4mGIKeqJRJmnuVXjg+WrCdQBUzEWHebwGIqgfgMgWNmYzhcC0 HTvsiWY5i689DVqn3BWIJVHEopGF2hESvrJQBS6emwQZDCGC4MDr9lv69CSadlPzplgDAXJGtUg avwd5hrNCMO7xyMHFaZyK7MQeue7TG9wED6juIOz5W8+MmajmF4k1ir01JoXFd3EopZxbN0V8sP SFmcP5u1+AG30xY+7UsfGwUzamr2k0t7xiVl8mVxMtluopzIG3d9mIdjjrAltXxYbjRw8KpSN9v 5n7HZocb8wTYOnfee116oFaX6pCjAG2uh3ECD5wvRa14TNzXwPLujy/CX0QWTpNWaDlo0l1tE+x /DZeiyvwrxuZWTLtYEfLaOavs/BIMzXkts423mQj6ynbjNjPVQpm49yHGdYGbEH7KbQ== X-Received: by 2002:a05:6402:2b98:b0:66e:4372:7518 with SMTP id 4fb4d7f45d1cf-679bb04e627mr4030790a12.2.1777467146780; Wed, 29 Apr 2026 05:52:26 -0700 (PDT) X-Received: by 2002:a05:6402:2b98:b0:66e:4372:7518 with SMTP id 4fb4d7f45d1cf-679bb04e627mr4030752a12.2.1777467146035; Wed, 29 Apr 2026 05:52:26 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:25 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Date: Wed, 29 Apr 2026 12:52:01 +0000 Message-Id: <20260429125206.1512203-7-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the debugfs and trace plumbing used to trigger and observe manual client reset. The reset interface exposes a trigger file for operator-initiated reset and a status file for tracking the most recent run. The tracepoints record scheduling, completion, and blocked caller behavior so reset progress can be diagnosed from the client side. debugfs layout under /sys/kernel/debug/ceph//reset/: trigger - write to initiate a manual reset status - read to see the most recent reset result The reset directory is cleaned up via debugfs_remove_recursive() on the parent, so individual file dentries are not stored. Tracepoints: ceph_client_reset_schedule - reset queued ceph_client_reset_complete - reset finished (success or failure) ceph_client_reset_blocked - caller blocked waiting for reset ceph_client_reset_unblocked - caller unblocked after reset All tracepoints use a null-safe access for monc.auth->global_id to guard against early-init or late-teardown edge cases. Signed-off-by: Alex Markuze --- fs/ceph/debugfs.c | 102 ++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.c | 8 +++ fs/ceph/super.h | 1 + include/trace/events/ceph.h | 67 +++++++++++++++++++++++ 4 files changed, 178 insertions(+) diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index 7dc307790240..beee4cfe8b18 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -9,6 +9,7 @@ #include #include #include +#include =20 #include #include @@ -360,16 +361,107 @@ static int status_show(struct seq_file *s, void *p) return 0; } =20 +static int reset_status_show(struct seq_file *s, void *p) +{ + struct ceph_fs_client *fsc =3D s->private; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + struct ceph_client_reset_state *st; + u64 trigger =3D 0, success =3D 0, failure =3D 0; + unsigned long last_start =3D 0, last_finish =3D 0; + int last_errno =3D 0; + enum ceph_client_reset_phase phase =3D CEPH_CLIENT_RESET_IDLE; + bool drain_timed_out =3D false; + int sessions_reset =3D 0; + int blocked_requests =3D 0; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + + if (!mdsc) + return 0; + + st =3D &mdsc->reset_state; + + spin_lock(&st->lock); + trigger =3D st->trigger_count; + success =3D st->success_count; + failure =3D st->failure_count; + last_start =3D st->last_start; + last_finish =3D st->last_finish; + last_errno =3D st->last_errno; + phase =3D st->phase; + drain_timed_out =3D st->drain_timed_out; + sessions_reset =3D st->sessions_reset; + strscpy(reason, st->last_reason, sizeof(reason)); + spin_unlock(&st->lock); + + blocked_requests =3D atomic_read(&st->blocked_requests); + + seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase)); + seq_printf(s, "trigger_count: %llu\n", trigger); + seq_printf(s, "success_count: %llu\n", success); + seq_printf(s, "failure_count: %llu\n", failure); + if (last_start) + seq_printf(s, "last_start_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_start)); + else + seq_puts(s, "last_start_ms_ago: (never)\n"); + if (last_finish) + seq_printf(s, "last_finish_ms_ago: %u\n", + jiffies_to_msecs(jiffies - last_finish)); + else + seq_puts(s, "last_finish_ms_ago: (never)\n"); + seq_printf(s, "last_errno: %d\n", last_errno); + seq_printf(s, "last_reason: %s\n", + reason[0] ? reason : "(none)"); + seq_printf(s, "drain_timed_out: %s\n", + drain_timed_out ? "yes" : "no"); + seq_printf(s, "sessions_reset: %d\n", sessions_reset); + seq_printf(s, "blocked_requests: %d\n", blocked_requests); + + return 0; +} + +static ssize_t reset_trigger_write(struct file *file, const char __user *b= uf, + size_t len, loff_t *ppos) +{ + struct ceph_fs_client *fsc =3D file->private_data; + struct ceph_mds_client *mdsc =3D fsc->mdsc; + char reason[CEPH_CLIENT_RESET_REASON_LEN]; + size_t copy; + int ret; + + if (!mdsc) + return -ENODEV; + + copy =3D min_t(size_t, len, sizeof(reason) - 1); + if (copy && copy_from_user(reason, buf, copy)) + return -EFAULT; + reason[copy] =3D '\0'; + strim(reason); + + ret =3D ceph_mdsc_schedule_reset(mdsc, reason); + if (ret) + return ret; + + return len; +} + DEFINE_SHOW_ATTRIBUTE(mdsmap); DEFINE_SHOW_ATTRIBUTE(mdsc); DEFINE_SHOW_ATTRIBUTE(caps); DEFINE_SHOW_ATTRIBUTE(mds_sessions); DEFINE_SHOW_ATTRIBUTE(status); +DEFINE_SHOW_ATTRIBUTE(reset_status); DEFINE_SHOW_ATTRIBUTE(metrics_file); DEFINE_SHOW_ATTRIBUTE(metrics_latency); DEFINE_SHOW_ATTRIBUTE(metrics_size); DEFINE_SHOW_ATTRIBUTE(metrics_caps); =20 +static const struct file_operations ceph_reset_trigger_fops =3D { + .owner =3D THIS_MODULE, + .open =3D simple_open, + .write =3D reset_trigger_write, + .llseek =3D noop_llseek, +}; =20 /* * debugfs @@ -404,6 +496,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc) debugfs_remove(fsc->debugfs_caps); debugfs_remove(fsc->debugfs_status); debugfs_remove(fsc->debugfs_mdsc); + debugfs_remove_recursive(fsc->debugfs_reset_dir); debugfs_remove_recursive(fsc->debugfs_metrics_dir); doutc(fsc->client, "done\n"); } @@ -451,6 +544,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc) fsc, &caps_fops); =20 + fsc->debugfs_reset_dir =3D debugfs_create_dir("reset", + fsc->client->debugfs_dir); + debugfs_create_file("trigger", 0200, + fsc->debugfs_reset_dir, fsc, + &ceph_reset_trigger_fops); + debugfs_create_file("status", 0400, + fsc->debugfs_reset_dir, fsc, + &reset_status_fops); + fsc->debugfs_status =3D debugfs_create_file("status", 0400, fsc->client->debugfs_dir, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 777af51ec8d8..8339c2c72f9a 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -5261,6 +5261,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *= mdsc) blocked_count =3D atomic_inc_return(&st->blocked_requests); doutc(cl, "request blocked during reset, %d total blocked\n", blocked_count); + trace_ceph_client_reset_blocked(mdsc, blocked_count); =20 retry: remaining =3D max_t(long, deadline - jiffies, 1); @@ -5272,10 +5273,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (wait_ret =3D=3D 0) { atomic_dec(&st->blocked_requests); pr_warn_client(cl, "timed out waiting for reset to complete\n"); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } if (wait_ret < 0) { atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret); return (int)wait_ret; /* -ERESTARTSYS */ } =20 @@ -5290,12 +5293,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client= *mdsc) if (time_before(jiffies, deadline)) goto retry; atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT); return -ETIMEDOUT; } ret =3D st->last_errno; spin_unlock(&st->lock); =20 atomic_dec(&st->blocked_requests); + trace_ceph_client_reset_unblocked(mdsc, ret); return ret ? -EIO : 0; } =20 @@ -5324,6 +5329,8 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_= client *mdsc, int ret) =20 /* Wake up all requests that were blocked waiting for reset */ wake_up_all(&st->blocked_wq); + + trace_ceph_client_reset_complete(mdsc, ret); } =20 static void ceph_mdsc_reset_workfn(struct work_struct *work) @@ -5633,6 +5640,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *= mdsc, pr_info_client(mdsc->fsc->client, "manual session reset scheduled (reason=3D\"%s\")\n", msg); + trace_ceph_client_reset_schedule(mdsc, msg); return 0; } =20 diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 9aca42c89ea0..5bf976b6c4fe 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -179,6 +179,7 @@ struct ceph_fs_client { struct dentry *debugfs_status; struct dentry *debugfs_mds_sessions; struct dentry *debugfs_metrics_dir; + struct dentry *debugfs_reset_dir; #endif =20 #ifdef CONFIG_CEPH_FSCACHE diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h index 08cb0659fbfc..1b990632f62b 100644 --- a/include/trace/events/ceph.h +++ b/include/trace/events/ceph.h @@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps, __entry->mseq) ); =20 +/* + * Client reset tracepoints - identify the client by its monitor- + * assigned global_id so traces remain meaningful when kernel pointer + * hashing is enabled. + */ +TRACE_EVENT(ceph_client_reset_schedule, + TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason), + TP_ARGS(mdsc, reason), + TP_STRUCT__entry( + __field(u64, client_id) + __string(reason, reason ? reason : "") + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __assign_str(reason); + ), + TP_printk("client_id=3D%llu reason=3D%s", + __entry->client_id, __get_str(reason)) +); + +TRACE_EVENT(ceph_client_reset_complete, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + +TRACE_EVENT(ceph_client_reset_blocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count), + TP_ARGS(mdsc, blocked_count), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, blocked_count) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->blocked_count =3D blocked_count; + ), + TP_printk("client_id=3D%llu blocked_count=3D%d", __entry->client_id, + __entry->blocked_count) +); + +TRACE_EVENT(ceph_client_reset_unblocked, + TP_PROTO(const struct ceph_mds_client *mdsc, int ret), + TP_ARGS(mdsc, ret), + TP_STRUCT__entry( + __field(u64, client_id) + __field(int, ret) + ), + TP_fast_assign( + __entry->client_id =3D mdsc->fsc->client->monc.auth ? + mdsc->fsc->client->monc.auth->global_id : 0; + __entry->ret =3D ret; + ), + TP_printk("client_id=3D%llu ret=3D%d", __entry->client_id, __entry->ret) +); + #undef EM #undef E_ #endif /* _TRACE_CEPH_H */ --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C4EA53FB7F4 for ; Wed, 29 Apr 2026 12:52:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467156; cv=none; b=k57FulPcePalKQPCRRsRRIwrlQzU8y8rwtyQRdq6vYxeJiAGWv+08nWlolo1vbN0+DzfYZuN48vXA5lBlVpNrll5I86qDDcnotUg5a4KCtoP6UWyN9nKqWsd+0cYHn9yBG7o+goyIAMmeOmXiQ/mhrv36yEMk+kemPAGY+U+kUk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467156; c=relaxed/simple; bh=T3SYwNAcawA/E2EIqbZFX+vNF+vMw0ueYcTwcN8x72o=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=rrI4S5MtFQMUtSzuZdErC4imLPbHV0+BDXjjZlnZe9D2ZOqG9UWFQHmkl5IuCG0VlkKlffAtTE3MeVknOt14fO03Yrbg272ShHIzWCyBg2E4364Nzn7Cla5CPUAVMfJQo2wXOSKvBdXChLjfMqfc4Gz0f7Y1pXxfL1M0tGgRDVw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=RuOjxjw9; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=G9Vpmb2q; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="RuOjxjw9"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="G9Vpmb2q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467152; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=RuOjxjw96VxubTz56RyncLqk8eJHEkR6uQC+XCHce4MXjQxpN738zhKWj3QxLhrJMCia0N 1mNxew6ySPWYmrPTuuWuIppDg6bPSZdLE1geUx2exnibshfhEbTwTz+VxiPeqr3OAiYBbP DrfOs0p2OF/Hcpg0VP2V38Ey1vsK3zw= Received: from mail-ej1-f69.google.com (mail-ej1-f69.google.com [209.85.218.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-101-6BZbmgr8OT6pIJpAs5c0Eg-1; Wed, 29 Apr 2026 08:52:30 -0400 X-MC-Unique: 6BZbmgr8OT6pIJpAs5c0Eg-1 X-Mimecast-MFC-AGG-ID: 6BZbmgr8OT6pIJpAs5c0Eg_1777467149 Received: by mail-ej1-f69.google.com with SMTP id a640c23a62f3a-b9c1d1f7e5dso1250818266b.0 for ; Wed, 29 Apr 2026 05:52:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467149; x=1778071949; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=G9Vpmb2qpnvTPfl7Xhf4xlEaVK+GTjrPSHG0WzVbnFRzo6i13QGERUQyNy4RvySyEm nvX0mko1bNEZ99eVyLk69esYMvluhzIexOJiBDxHavzq0tUwQnGlOlWR9AhIenqsmKBN JCPpawhUR+zakEursQBAUMUp3fLFv5vb8emAMX7+nXu2Z81AHFiKszbqmpqwYxb2aDP2 ENjjjDQIIxAMJsQbM/6phrT9uLOwmY6ZeAj8CZO1EnArcuVr3JITKKFCOmMuGwV8nX8b Q9Q36kLNwxz3ga1bahkCb/mtPj/vaX3N9yNS5dCrEk63rOaPhVkKGV/AiwbPpQ/PZQSF XSAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467149; x=1778071949; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=kH1rfK/xXKO/jWZH3nIhV338I1U55E10FikanWCJaSM=; b=kOg3+IDepXytXXlgSFiu6Ngw/SMVV12/+zhBbMc288SjJzln1dlDDkS42QC9iZvUmX g5LbykrqJQ+8JogM/DjT5PwbFOuygNmd6+1w0zlmMMhkcVXr4NId+9cAjBBd48zOU1ig o71EhllUZ53GVbn1C6ZLwHviWiLk+AJTAavQvGQqfHFz2jw3KXjUgneSFflLWOsoRPqm hnUzzPi838rX5cFLNHsd0MbRd2iBL6fkR0iCUtEfxBTeHDBi/jqaaFdzzQ8VCsELfQE3 Fq0fcMojQfFZF5OkoqZiwsNs7EBVymSpGwb3tupFsvEm7L2xem1YHNpwSoDDhfO3JHgZ VWtg== X-Gm-Message-State: AOJu0YwiUwk2jA4xtPMtRxQMggGszOaK9U2q8w4a0k3ywfXpLsoZf5wc KjLZVox0WO4UB5RDqOYqp8Llt7/9Sh1/z/pwq0Gdpopz7g9AYnGIIMMwXNT+9epWBJqoMZXmxDM bFNpTfoF4TSggbdmOBIu4e4QOxowfzeiSzsqoYvBEUC2MERrarLrJzNjgk1agAohUDA== X-Gm-Gg: AeBDieuMLU4slfIEZCHSxRcbT2bzLp6o8D+mXSAWsDGw53zNhAqswvKOKJKaM19JByI HDL7sYDl+r4ND/5NlHVbeG+/sZqJDM7tTCaDX6gMEBmKHrk3xvN3ygXa9vvYBRhydVNGzpRBS2t uUVHIv6IBp241vLilVrtOxIDcRIOm4bayk/q/8R23hx0x6orJyHJ1AhMQcO6WQt61D2BOvQjroH hjgv8e40J32Vx3tM9DmvGTTbsuoRsJ63n5heAeEu5W8AwNQe9h1tfaWgv8WOEJDw2J0CK9YtHun 2LxXl+cANajjs+l5YeQWoXVlG6cj8QrvcfWCQjgH+WAYVKviz/vHibjdVXLx3YVtfDJR4hJSIRe jes36LUXT0mxiVxmYrKI6ruPwr5ViZf1ekNHw9KNApSRODWPfAbJIhWrzHOYjfcV7Bw== X-Received: by 2002:a17:907:e118:b0:b9d:8697:73b8 with SMTP id a640c23a62f3a-bb801dd4214mr311170966b.22.1777467148666; Wed, 29 Apr 2026 05:52:28 -0700 (PDT) X-Received: by 2002:a17:907:e118:b0:b9d:8697:73b8 with SMTP id a640c23a62f3a-bb801dd4214mr311169566b.22.1777467148074; Wed, 29 Apr 2026 05:52:28 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:27 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 07/11] selftests: ceph: add reset consistency checker Date: Wed, 29 Apr 2026 12:52:02 +0000 Message-Id: <20260429125206.1512203-8-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a Python post-run validator for the CephFS client reset stress test. The script reads data files written by the stress runner and checks that every file was either written completely or is missing, with no partial or corrupted content. This is a prerequisite for the stress test script which invokes it after each run. Signed-off-by: Alex Markuze --- .../filesystems/ceph/validate_consistency.py | 297 ++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consi= stency.py diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.= py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py new file mode 100755 index 000000000000..c230a59bdb3a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py @@ -0,0 +1,297 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import argparse +import bisect +import hashlib +import json +import os +from pathlib import Path + + +def sha256_file(path: Path) -> str: + digest =3D hashlib.sha256() + with path.open("rb") as handle: + while True: + chunk =3D handle.read(1 << 20) + if not chunk: + break + digest.update(chunk) + return digest.hexdigest() + + +def parse_io_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 5: + raise ValueError(f"io log line {line_no}: expected 5 colum= ns, got {len(parts)}") + ts_ms, seq, logical_id, relpath, digest =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "relpath": relpath, + "digest": digest, + } + ) + return records + + +def parse_rename_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) =3D=3D 6: + ts_ms, seq, logical_id, src_rel, dst_rel, rc =3D parts + elif len(parts) =3D=3D 7: + ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = =3D parts + _ =3D worker_id # worker id is informational only + else: + raise ValueError( + f"rename log line {line_no}: expected 6 or 7 columns, = got {len(parts)}" + ) + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "logical_id": int(logical_id), + "src_rel": src_rel, + "dst_rel": dst_rel, + "rc": int(rc), + } + ) + return records + + +def parse_reset_log(path: Path): + records =3D [] + if not path.exists(): + return records + with path.open("r", encoding=3D"utf-8") as handle: + for line_no, line in enumerate(handle, 1): + line =3D line.strip() + if not line: + continue + parts =3D line.split(",") + if len(parts) !=3D 4: + raise ValueError(f"reset log line {line_no}: expected 4 co= lumns, got {len(parts)}") + ts_ms, seq, reason, rc =3D parts + records.append( + { + "ts_ms": int(ts_ms), + "seq": int(seq), + "reason": reason, + "rc": int(rc), + } + ) + return records + + +def parse_status_file(path: Path): + status =3D {} + if not path.exists(): + return status + with path.open("r", encoding=3D"utf-8") as handle: + for line in handle: + line =3D line.strip() + if not line or ":" not in line: + continue + key, value =3D line.split(":", 1) + status[key.strip()] =3D value.strip() + return status + + +def to_int(value: str, default: int =3D 0): + try: + return int(value) + except Exception: + return default + + +def validate_namespace(data_dir: Path, file_count: int, issues): + actual_locations =3D {} + actual_paths =3D {} + for logical_id in range(file_count): + name =3D f"file_{logical_id:05d}" + found =3D [] + for subdir in ("A", "B"): + candidate =3D data_dir / subdir / name + if candidate.exists(): + found.append((subdir, candidate)) + if len(found) !=3D 1: + issues.append( + f"namespace invariant failed for logical_id=3D{logical_id:= 05d}: expected exactly one file in A/B, found {len(found)}" + ) + continue + actual_locations[logical_id] =3D found[0][0] + actual_paths[logical_id] =3D found[0][1] + return actual_locations, actual_paths + + +def validate_rename_invariant(rename_records, actual_locations, issues): + expected_locations =3D {} + for rec in rename_records: + if rec["rc"] !=3D 0: + continue + dst =3D rec["dst_rel"] + if "/" not in dst: + continue + expected_locations[rec["logical_id"]] =3D dst.split("/", 1)[0] + + for logical_id, expected in expected_locations.items(): + actual =3D actual_locations.get(logical_id) + if actual is None: + continue + if actual !=3D expected: + issues.append( + f"rename invariant failed for logical_id=3D{logical_id:05d= }: expected location=3D{expected}, actual=3D{actual}" + ) + + +def validate_data_invariant(io_records, actual_paths, issues): + expected_hash =3D {} + for rec in io_records: + digest =3D rec["digest"] + if not digest: + continue + expected_hash[rec["logical_id"]] =3D digest + + for logical_id, digest in expected_hash.items(): + path =3D actual_paths.get(logical_id) + if path is None: + continue + actual_digest =3D sha256_file(path) + if digest !=3D actual_digest: + issues.append( + f"data invariant failed for logical_id=3D{logical_id:05d}:= expected digest=3D{digest}, actual digest=3D{actual_digest}" + ) + + +def validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues): + if not args.expect_reset: + return + + successful_reset_times =3D [rec["ts_ms"] for rec in reset_records if r= ec["rc"] =3D=3D 0] + if not successful_reset_times: + issues.append("expected reset activity but no successful reset tri= gger was observed") + + phase =3D status.get("phase") + blocked_requests =3D to_int(status.get("blocked_requests", "0"), defau= lt=3D-1) + last_errno =3D to_int(status.get("last_errno", "0"), default=3D1) + failure_count =3D to_int(status.get("failure_count", "0"), default=3D-= 1) + + if phase is None: + issues.append("missing final reset status file or phase field") + elif phase.lower() !=3D "idle": + issues.append(f"recovery invariant failed: phase=3D{phase}, expect= ed idle") + + if blocked_requests !=3D 0: + issues.append(f"recovery invariant failed: blocked_requests=3D{blo= cked_requests}, expected 0") + if last_errno !=3D 0: + issues.append(f"recovery invariant failed: last_errno=3D{last_errn= o}, expected 0") + if failure_count > 0: + issues.append( + f"recovery invariant failed: failure_count=3D{failure_count}, " + "one or more resets failed during the run" + ) + + op_times =3D [rec["ts_ms"] for rec in io_records] + op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] = =3D=3D 0) + op_times.sort() + + if successful_reset_times and not op_times: + issues.append("recovery SLO failed: no workload completion events = were recorded") + return + + slo_ms =3D args.slo_seconds * 1000 + for ts in successful_reset_times: + idx =3D bisect.bisect_left(op_times, ts) + if idx >=3D len(op_times): + issues.append(f"recovery SLO failed: no operation completion o= bserved after reset at ts_ms=3D{ts}") + continue + delta =3D op_times[idx] - ts + if delta > slo_ms: + issues.append( + f"recovery SLO failed: first post-reset completion at {del= ta}ms exceeds threshold {slo_ms}ms (reset ts_ms=3D{ts})" + ) + + +def main(): + parser =3D argparse.ArgumentParser(description=3D"Validate Ceph reset = stress artifacts") + parser.add_argument("--data-dir", required=3DTrue) + parser.add_argument("--file-count", required=3DTrue, type=3Dint) + parser.add_argument("--io-log", required=3DTrue) + parser.add_argument("--rename-log", required=3DTrue) + parser.add_argument("--reset-log", required=3DTrue) + parser.add_argument("--status-final", required=3DFalse, default=3D"") + parser.add_argument("--slo-seconds", required=3DFalse, type=3Dint, def= ault=3D30) + parser.add_argument("--expect-reset", action=3D"store_true") + parser.add_argument("--report-json", required=3DFalse, default=3D"") + args =3D parser.parse_args() + + data_dir =3D Path(args.data_dir) + io_log =3D Path(args.io_log) + rename_log =3D Path(args.rename_log) + reset_log =3D Path(args.reset_log) + status_final =3D Path(args.status_final) if args.status_final else Pat= h("__missing_status__") + + issues =3D [] + + if not data_dir.exists(): + issues.append(f"data directory is missing: {data_dir}") + + try: + io_records =3D parse_io_log(io_log) + rename_records =3D parse_rename_log(rename_log) + reset_records =3D parse_reset_log(reset_log) + except Exception as exc: + issues.append(f"log parsing failed: {exc}") + io_records =3D [] + rename_records =3D [] + reset_records =3D [] + + status =3D parse_status_file(status_final) + + actual_locations, actual_paths =3D validate_namespace(data_dir, args.f= ile_count, issues) + validate_rename_invariant(rename_records, actual_locations, issues) + validate_data_invariant(io_records, actual_paths, issues) + validate_reset_and_slo(args, reset_records, io_records, rename_records= , status, issues) + + report =3D { + "file_count": args.file_count, + "io_records": len(io_records), + "rename_records": len(rename_records), + "reset_records": len(reset_records), + "expect_reset": args.expect_reset, + "issues": issues, + } + + if args.report_json: + report_path =3D Path(args.report_json) + report_path.write_text(json.dumps(report, indent=3D2, sort_keys=3D= True), encoding=3D"utf-8") + + if issues: + print("FAIL: consistency validation found issues") + for issue in issues: + print(f" - {issue}") + raise SystemExit(1) + + print("PASS: consistency validation succeeded") + + +if __name__ =3D=3D "__main__": + main() --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DEF7B15746F for ; Wed, 29 Apr 2026 12:52:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467156; cv=none; b=d3jtmaCDGkEJqNjzv71xRlfNJgcLQFPq7bavPbQWls8AciLbY1pTHVt0u4A3Xf8xR1bmarT2n3RZOR0MbJAp3yB4NYmyn0+wJoejOlCiq1f/UHqpf+dJhcuVAEpRD8jJBQRJIOGNusaVbD17L815HdRP2HpVjkbSc6jdIyr7GUs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467156; c=relaxed/simple; bh=gL/jSOB0IbbkKEWXnBLxDu/IoPYG7MLarbK3QKLi5pE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=THDgcaaUggHByXOdrVxkVb1mmMO8gtu6nG6jj59t33gMtuSy8qNeySOeRvTBgrIkffs+9lSQt97HOnVj+mWTwbVS0s065jpbTeS8rF+kGtHQYQXeUDO7YGe83F0kcXAGiAVrh27eXHAno0bg6CUXjeXTTGpOAYsQWyZWP5t/OrQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=P/6/vUdR; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=pneEepzA; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="P/6/vUdR"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="pneEepzA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467153; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=P/6/vUdRrzA02mGwWgpTJh60lMyGyxlnecFKLtfSWvtrr9u3tmfai/mDxRKD0DW33/wzZ1 yE7WQLTxQfCszqvHOzXdT1lKorUgVhcSlN7/3DB9OtNb9ykPHyicr8m+YTxwNqnshzNz2X VnNBbUkKq+EzJ1eul2Ksbf67bzvWgoo= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-681-aPTeYlyONIqVKXVK_YRuzg-1; Wed, 29 Apr 2026 08:52:31 -0400 X-MC-Unique: aPTeYlyONIqVKXVK_YRuzg-1 X-Mimecast-MFC-AGG-ID: aPTeYlyONIqVKXVK_YRuzg_1777467151 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-b844098869cso975374766b.2 for ; Wed, 29 Apr 2026 05:52:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467150; x=1778071950; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=pneEepzAbSvU/c3xaoksQsDJoqkEBmW2cZzZt61f5nksxQYxsPlnABw/s6H+2MPvv4 g4uERlEgp0n2EhTLYWy1334tlBmc1YCDkEKHFKaRZQhlY9Clke1YrzGLTmMP6CJLMu6Z YwacyQngU4ZpHBDI0SP29DL4DWDryS+W+oWUQwPpUoepQM6Dz2j+7Wovi5pro9o+6AVv ov/UIN14lSojq2B/Stgmr+OyRALThCf+Hg7K+JmIumTSL8u2pZYnYsQPqNr9W95ekivY bDeKmPdApKujFY9JZ4gmtyDW2VdD/KMJLeQe95DDQjl9pRrGTjvXk+Ioiwu2mGp7C93Q XdMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467150; x=1778071950; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=3wPDtDVExZ0W3N9cH87fjGtCkBCDVUlOYUQfuZUKnE4=; b=ZDSLN6Ag6fl9zXm0lB0473GVlC7Qm4XPVwKZjpN3YDeChsmVcjLz+39V/WMK0WmKBV nwpT8Kuc/WA4KHr1A2eR72U6IO9h2OYkNC2uNj/hILBfuz60pJ2TqrgfoYSP/p5Txtve 6z9Yfj9qe2lCG5+LoVesy/cRJI4I9ZMa4kUgQlgE87VCJiXm2BQhCnd9Jb6JQ8qEo8dP bZIR7YSOkDbsIALsLNzi8mM/I2RKLKHIF4JUmZA0YHk1n6tXWnL2lOa6VQLA8Y1xg/A4 h2hOwE17h3RFfI3w6LB1mkwygeMwWPWgUfU+u9Z5UhJmmjn/uKen9TpXyd3iOosfCOvL TFcw== X-Gm-Message-State: AOJu0YwGwyRKl0RT0MpYnLg2LbpEwFPDldHgMvKgZtnWxSTIUdnG1aQD lRoElAxT6eaGxvfvCqOCyW4WTQsrvRLGlY4yeoDynTm/nBpF1M/+ifFt/1CZsl1knZjrlImncMB GVbJCnt7I9kBwrhye/Q+wwjn1raJdZKDm3EaZ2h/6djF5IfSVCsULZ7dzD3CczrbnRg== X-Gm-Gg: AeBDietOYHNX2bNfkaVY33ENzI6Tn5Do1uozWSh47SFchgntLyUw1RB09Bk1sMm3+Pv hHyPLl2/wTQLHAQKW1C0WZ2ezG5CMNJ5RkchdquTh6FiktYW2B30E+nTVeJFw93gDSaQtmAJ5bR pVk46vuRPI0LuYajrfhCUEGxabx9VQaL6V+4QnfOK977WyOBGAjcOk+n2wQZtI5nSOiLQSq9RjF DyJrQ2xA7+XUp1LWKYNGN4Effx9gyC6zpx8T5xhZWHoAA0q2x0RbGaNRG/XL2abc4ilfa85Sm6e 7MCNL5r1aaUqvcLTHTwTQXdnYxjEXEeM0WFleE7mJvwJWsp51qoiy0ZeGB1w/0IKeWi+DuCvGoy YZEU4mdDLobfz310Mp4mvcEuedncb59xaOY4//dMrZ/3ZFe7DkNLlrgnSkMKqKty04Q== X-Received: by 2002:a17:907:c783:b0:bab:f5c7:23ca with SMTP id a640c23a62f3a-bb93f04382emr222520966b.38.1777467149925; Wed, 29 Apr 2026 05:52:29 -0700 (PDT) X-Received: by 2002:a17:907:c783:b0:bab:f5c7:23ca with SMTP id a640c23a62f3a-bb93f04382emr222518266b.38.1777467149191; Wed, 29 Apr 2026 05:52:29 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:28 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 08/11] selftests: ceph: add reset stress test Date: Wed, 29 Apr 2026 12:52:03 +0000 Message-Id: <20260429125206.1512203-9-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a single-client stress test for the CephFS manual session reset feature. The test runs concurrent I/O workers alongside periodic reset injection, then validates data integrity via validate_consistency.py. Supports four profiles (baseline, moderate, aggressive, soak) with configurable duration, reset interval, and worker counts. Signed-off-by: Alex Markuze --- .../filesystems/ceph/reset_stress.sh | 694 ++++++++++++++++++ 1 file changed, 694 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/too= ls/testing/selftests/filesystems/ceph/reset_stress.sh new file mode 100755 index 000000000000..c503c75a5f7a --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh @@ -0,0 +1,694 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS reset stress test: +# - Runs concurrent I/O and rename workloads +# - Triggers random client resets through debugfs +# - Validates consistency and recovery behavior + +set -euo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +PROFILE=3D"moderate" +DURATION_SEC=3D"" +COOLDOWN_SEC=3D20 +FILE_COUNT=3D64 +IO_WORKERS=3D"" +RENAME_WORKERS=3D"" +MOUNT_POINT=3D"" +OUT_DIR=3D"" +CLIENT_ID=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +SLO_SECONDS=3D30 +EXPECT_RESET=3D1 +DMESG_CMD=3D"" +SUDO=3D"" + +RESET_MIN_SEC=3D5 +RESET_MAX_SEC=3D15 + +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +WORKLOAD_FLAG=3D"" +RESET_FLAG=3D"" +DATA_DIR=3D"" + +IO_LOG=3D"" +RENAME_LOG=3D"" +RESET_LOG=3D"" +STATUS_LOG=3D"" +STATUS_BEFORE=3D"" +STATUS_FINAL=3D"" +DMESG_LOG=3D"" +SUMMARY_LOG=3D"" +REPORT_JSON=3D"" + +RESET_PID=3D0 +STATUS_PID=3D0 +declare -a IO_WORKER_PIDS=3D() +declare -a RENAME_WORKER_PIDS=3D() + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point to test under + +Options: + --profile NAME baseline|moderate|aggressive|soak (default: mod= erate) + --duration-sec N Override profile runtime in seconds + --cooldown-sec N Workload drain time after injector stop (defaul= t: 20) + --file-count N Number of logical files (default: 64) + --io-workers N Number of concurrent I/O workers (profile defau= lt) + --rename-workers N Number of concurrent rename workers (profile de= fault) + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_st= ress_) + --client-id ID Ceph debugfs client id; auto-detect if one clie= nt exists + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/c= eph) + --slo-seconds N Max allowed post-reset stall window (default: 3= 0) + --no-reset Disable reset injector (baseline mode helper) + --help Show this message + +Examples: + $0 --mount-point /mnt/cephfs --profile moderate + $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300 + $0 --mount-point /mnt/cephfs --profile baseline --no-reset +EOF +} + +now_ms() +{ + date +%s%3N +} + +set_profile_defaults() +{ + case "$PROFILE" in + baseline) + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + EXPECT_RESET=3D0 + : "${DURATION_SEC:=3D600}" + : "${IO_WORKERS:=3D1}" + : "${RENAME_WORKERS:=3D1}" + ;; + moderate) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + aggressive) + RESET_MIN_SEC=3D1 + RESET_MAX_SEC=3D5 + : "${DURATION_SEC:=3D900}" + : "${IO_WORKERS:=3D4}" + : "${RENAME_WORKERS:=3D2}" + ;; + soak) + RESET_MIN_SEC=3D5 + RESET_MAX_SEC=3D15 + : "${DURATION_SEC:=3D3600}" + : "${IO_WORKERS:=3D2}" + : "${RENAME_WORKERS:=3D1}" + ;; + *) + echo "Unknown profile: $PROFILE" >&2 + exit 2 + ;; + esac +} + +log_summary() +{ + local msg=3D"$1" + printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUM= MARY_LOG" +} + +discover_client_id() +{ + local candidates=3D() + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then + echo "SKIP: reset debugfs not found for client-id=3D$CLIENT_ID" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + if ! $SUDO test -d "$DEBUGFS_ROOT"; then + echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2 + exit "$KSFT_SKIP" + fi + + while IFS=3D read -r entry; do + $SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue + $SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue + candidates+=3D("$entry") + done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true) + + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + return 0 + fi + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-= id." >&2 + exit "$KSFT_SKIP" +} + +init_dataset() +{ + local i + mkdir -p "$DATA_DIR/A" "$DATA_DIR/B" + + for ((i =3D 0; i < FILE_COUNT; i++)); do + printf 'seed logical_id=3D%05d ts_ms=3D%s\n' "$i" "$(now_ms)" > "$DATA_D= IR/A/file_$(printf '%05d' "$i")" + done +} + +io_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local relpath + local abspath + local payload + local hash + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + relpath=3D"A/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + relpath=3D"B/file_$id" + else + sleep 0.02 + continue + fi + + abspath=3D"$DATA_DIR/$relpath" + alt_relpath=3D"" + if [[ "$relpath" =3D=3D A/* ]]; then + alt_relpath=3D"B/file_$id" + else + alt_relpath=3D"A/file_$id" + fi + alt_abspath=3D"$DATA_DIR/$alt_relpath" + payload=3D"worker=3D${worker_id} io_seq=3D${seq} id=3D${id} ts_ms=3D$(no= w_ms)" + result=3D"$( + python3 - "$abspath" "$alt_abspath" "$payload" <<'PY' +import hashlib +import os +import sys + +path =3D sys.argv[1] +alt_path =3D sys.argv[2] +payload =3D sys.argv[3] + +try: + fd =3D os.open(path, os.O_RDWR | os.O_APPEND) + actual =3D path +except FileNotFoundError: + try: + fd =3D os.open(alt_path, os.O_RDWR | os.O_APPEND) + actual =3D alt_path + except FileNotFoundError: + sys.exit(1) + +try: + os.write(fd, (payload + "\n").encode()) + os.fsync(fd) + os.lseek(fd, 0, os.SEEK_SET) + digest =3D hashlib.sha256() + while True: + chunk =3D os.read(fd, 1 << 20) + if not chunk: + break + digest.update(chunk) + print(actual + " " + digest.hexdigest()) +finally: + os.close(fd) +PY + )" || { + sleep 0.02 + continue + } + + actual_abspath=3D"${result%% *}" + hash=3D"${result#* }" + if [[ "$actual_abspath" =3D=3D "$alt_abspath" ]]; then + relpath=3D"$alt_relpath" + fi + + ts=3D"$(now_ms)" + printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_= LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +rename_worker() +{ + set +e + local worker_id=3D"$1" + local seq=3D0 + local id + local src_rel + local dst_rel + local rc + local ts + + while [[ -f "$WORKLOAD_FLAG" ]]; do + id=3D"$(printf '%05d' $((RANDOM % FILE_COUNT)))" + + if [[ -f "$DATA_DIR/A/file_$id" ]]; then + src_rel=3D"A/file_$id" + dst_rel=3D"B/file_$id" + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then + src_rel=3D"B/file_$id" + dst_rel=3D"A/file_$id" + else + sleep 0.02 + continue + fi + + ts=3D"$(now_ms)" + if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_re= l" "$dst_rel" "$rc" >> "$RENAME_LOG" + seq=3D$((seq + 1)) + sleep 0.02 + done +} + +random_sleep_seconds() +{ + local min_sec=3D"$1" + local max_sec=3D"$2" + local wait_sec + local span + + span=3D$((max_sec - min_sec + 1)) + wait_sec=3D$((min_sec + RANDOM % span)) + sleep "$wait_sec" +} + +reset_injector() +{ + set +e + local trigger_path=3D"$1" + local seq=3D0 + local ts + local reason + local rc + + while [[ -f "$RESET_FLAG" ]]; do + random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC" + [[ -f "$RESET_FLAG" ]] || break + + ts=3D"$(now_ms)" + reason=3D"stress_${seq}_${ts}" + if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then + rc=3D0 + else + rc=3D$? + fi + printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG" + seq=3D$((seq + 1)) + done +} + +status_sampler() +{ + set +e + local status_path=3D"$1" + local ts + local kv_line + + while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do + ts=3D"$(now_ms)" + if $SUDO test -r "$status_path"; then + kv_line=3D"$($SUDO awk -F': ' 'NF>=3D2 {gsub(/ /, "", $1); gsub(/ /, ""= , $2); printf "%s=3D%s;", $1, $2}' "$status_path")" + printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG" + fi + sleep 1 + done +} + +stop_pid_with_timeout() +{ + local pid=3D"$1" + local name=3D"$2" + local timeout=3D"$3" + local waited=3D0 + + if [[ "$pid" -le 0 ]]; then + return 0 + fi + + while kill -0 "$pid" 2>/dev/null; do + if (( waited >=3D timeout )); then + log_summary "Timeout waiting for $name (pid=3D$pid), sending SIGTERM/SI= GKILL" + kill -TERM "$pid" 2>/dev/null || true + sleep 1 + kill -KILL "$pid" 2>/dev/null || true + wait "$pid" 2>/dev/null || true + return 1 + fi + sleep 1 + waited=3D$((waited + 1)) + done + + wait "$pid" 2>/dev/null || true + return 0 +} + +detect_privileges() +{ + if [[ -r "$DEBUGFS_ROOT" ]]; then + SUDO=3D"" + elif sudo -n true 2>/dev/null; then + SUDO=3D"sudo" + else + echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is no= t available" >&2 + echo "WARNING: reset injection, debugfs status checks, and dmesg capture= will not work" >&2 + fi + + if $SUDO dmesg > /dev/null 2>&1; then + DMESG_CMD=3D"$SUDO dmesg" + else + DMESG_CMD=3D"" + echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will = not be detected" >&2 + fi +} + +check_dmesg() +{ + local start_epoch=3D"$1" + + if [[ -z "$DMESG_CMD" ]]; then + return 0 + fi + + if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then + if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then + log_summary "WARNING: dmesg capture failed unexpectedly" + return 0 + fi + log_summary "dmesg --since unsupported; captured full dmesg" + fi + + if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then + log_summary "ERROR: kernel log contains 'hung task' during test window" + return 1 + fi + + return 0 +} + +cleanup() +{ + rm -f "$WORKLOAD_FLAG" "$RESET_FLAG" + local pid + for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID"= "$STATUS_PID"; do + [[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true + done + wait 2>/dev/null || true +} + +parse_args() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) + MOUNT_POINT=3D"$2" + shift 2 + ;; + --profile) + PROFILE=3D"$2" + shift 2 + ;; + --duration-sec) + DURATION_SEC=3D"$2" + shift 2 + ;; + --cooldown-sec) + COOLDOWN_SEC=3D"$2" + shift 2 + ;; + --file-count) + FILE_COUNT=3D"$2" + shift 2 + ;; + --io-workers) + IO_WORKERS=3D"$2" + shift 2 + ;; + --rename-workers) + RENAME_WORKERS=3D"$2" + shift 2 + ;; + --out-dir) + OUT_DIR=3D"$2" + shift 2 + ;; + --client-id) + CLIENT_ID=3D"$2" + shift 2 + ;; + --debugfs-root) + DEBUGFS_ROOT=3D"$2" + shift 2 + ;; + --slo-seconds) + SLO_SECONDS=3D"$2" + shift 2 + ;; + --no-reset) + EXPECT_RESET=3D0 + shift + ;; + --help|-h) + usage + exit 0 + ;; + *) + echo "Unknown option: $1" >&2 + usage + exit 2 + ;; + esac + done +} + +main() +{ + local start_epoch + local trigger_path=3D"" + local status_path=3D"" + local final_rc=3D0 + local reset_enabled=3D0 + local i + + parse_args "$@" + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + rm -f "$MOUNT_POINT/.ceph_reset_test_probe" + + if ! command -v python3 > /dev/null 2>&1; then + echo "SKIP: python3 is required but not found in PATH" >&2 + exit "$KSFT_SKIP" + fi + + if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then + echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2 + fi + + detect_privileges + + set_profile_defaults + if [[ "$EXPECT_RESET" -eq 0 ]]; then + PROFILE=3D"baseline" + RESET_MIN_SEC=3D0 + RESET_MAX_SEC=3D0 + fi + + if ! [[ "$IO_WORKERS" =3D~ ^[0-9]+$ && "$RENAME_WORKERS" =3D~ ^[0-9]+$ ]]= ; then + echo "io-workers and rename-workers must be integers" >&2 + exit 2 + fi + + if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then + echo "io-workers and rename-workers must be > 0" >&2 + exit 2 + fi + + if [[ -z "$OUT_DIR" ]]; then + OUT_DIR=3D"/tmp/ceph_reset_stress_${RUN_ID}" + fi + mkdir -p "$OUT_DIR" + + WORKLOAD_FLAG=3D"$OUT_DIR/workload.running" + RESET_FLAG=3D"$OUT_DIR/reset.running" + + DATA_DIR=3D"$MOUNT_POINT/ceph_reset_stress_${RUN_ID}" + mkdir -p "$DATA_DIR" + + IO_LOG=3D"$OUT_DIR/io.log" + RENAME_LOG=3D"$OUT_DIR/rename.log" + RESET_LOG=3D"$OUT_DIR/reset.log" + STATUS_LOG=3D"$OUT_DIR/status.log" + STATUS_BEFORE=3D"$OUT_DIR/reset_status.before" + STATUS_FINAL=3D"$OUT_DIR/reset_status.final" + DMESG_LOG=3D"$OUT_DIR/dmesg.log" + SUMMARY_LOG=3D"$OUT_DIR/summary.log" + REPORT_JSON=3D"$OUT_DIR/validator_report.json" + + : > "$IO_LOG" + : > "$RENAME_LOG" + : > "$RESET_LOG" + : > "$STATUS_LOG" + : > "$SUMMARY_LOG" + + start_epoch=3D"$(date +%s)" + + log_summary "Starting Ceph reset stress test" + log_summary "Profile=3D$PROFILE duration=3D${DURATION_SEC}s cooldown=3D${= COOLDOWN_SEC}s file_count=3D${FILE_COUNT} io_workers=3D${IO_WORKERS} rename= _workers=3D${RENAME_WORKERS}" + [[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations" + [[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung = task detection disabled" + log_summary "Artifacts=3D$OUT_DIR" + log_summary "Data dir=3D$DATA_DIR" + + init_dataset + + if [[ "$EXPECT_RESET" -eq 1 ]]; then + discover_client_id + trigger_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger" + status_path=3D"$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + if ! $SUDO test -w "$trigger_path"; then + echo "SKIP: Reset trigger is not writable: $trigger_path" >&2 + exit "$KSFT_SKIP" + fi + if ! $SUDO test -r "$status_path"; then + echo "SKIP: Reset status is not readable: $status_path" >&2 + exit "$KSFT_SKIP" + fi + $SUDO cat "$status_path" > "$STATUS_BEFORE" || true + reset_enabled=3D1 + log_summary "Using ceph client id: $CLIENT_ID" + fi + + trap cleanup EXIT INT TERM + + touch "$WORKLOAD_FLAG" + for ((i =3D 0; i < IO_WORKERS; i++)); do + io_worker "$i" & + IO_WORKER_PIDS+=3D("$!") + done + + for ((i =3D 0; i < RENAME_WORKERS; i++)); do + rename_worker "$i" & + RENAME_WORKER_PIDS+=3D("$!") + done + + if [[ "$reset_enabled" -eq 1 ]]; then + touch "$RESET_FLAG" + reset_injector "$trigger_path" & + RESET_PID=3D$! + + status_sampler "$status_path" & + STATUS_PID=3D$! + fi + + sleep "$DURATION_SEC" + + if [[ "$reset_enabled" -eq 1 ]]; then + rm -f "$RESET_FLAG" + stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=3D1 + log_summary "Injector stopped; entering cooldown=3D${COOLDOWN_SEC}s" + fi + + sleep "$COOLDOWN_SEC" + + rm -f "$WORKLOAD_FLAG" + for i in "${!IO_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || fina= l_rc=3D1 + done + for i in "${!RENAME_WORKER_PIDS[@]}"; do + stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20= || final_rc=3D1 + done + + if [[ "$reset_enabled" -eq 1 ]]; then + stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=3D1 + $SUDO cat "$status_path" > "$STATUS_FINAL" || true + fi + + if ! check_dmesg "$start_epoch"; then + final_rc=3D1 + fi + + if ! python3 "$SCRIPT_DIR/validate_consistency.py" \ + --data-dir "$DATA_DIR" \ + --file-count "$FILE_COUNT" \ + --io-log "$IO_LOG" \ + --rename-log "$RENAME_LOG" \ + --reset-log "$RESET_LOG" \ + --status-final "$STATUS_FINAL" \ + --slo-seconds "$SLO_SECONDS" \ + --report-json "$REPORT_JSON" \ + $( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then + final_rc=3D1 + fi + + if [[ "$final_rc" -eq 0 ]]; then + log_summary "PASS: stress run completed successfully" + else + log_summary "FAIL: stress run detected one or more failures" + fi + + log_summary "Artifacts available in: $OUT_DIR" + exit "$final_rc" +} + +main "$@" --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E1EBA3FB7E5 for ; Wed, 29 Apr 2026 12:52:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467158; cv=none; b=CiX0bfJXlhhJ2ha9ELuzO0Je17TfKMsVycBq2zfhDTobUGHR7+4XHmaXv9CoQ+aCmrUUXQbiFwtt5Qb167rskKgsGsPM7QVealNOcemrky2yJgNH7bbUKPB1iM6oVT7gxIwizXuEvn/yU/s7PcXKUH4y8zgu188dutnLPFyCJGk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467158; c=relaxed/simple; bh=MifFNiC+TKC8lDQyt2PtDrVcrjsIyIRVXqCVpmvmps4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=b60snPNJwamY2IuqI9YMpR7uA/Uokrk4odNxCHNAyuJhIXqm0DnAyTGbQTaQDwGGwSp4tH7YQ0tJO9Wcf+3s1NwsGfCb9riUfTlqzrZRP/YxGXIAscVHmRTHxnootzyXh02f2WCzFaEx+cRmVaNaffkujEv+OsEpGMBzDjWGs3k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FhmI8M7U; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=KfAtBdkF; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FhmI8M7U"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="KfAtBdkF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467155; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=848yXqrWfV86+DAaUBiE05fWjJwwzyvaF5Dj6MZA6OU=; b=FhmI8M7U84qhyBB4OVG1Kz84YVmqU7BkP45NgD6Px4cR9Eo40nPLzXYbPrJn+lkCXkCsVX XluhKADjn5R1eBS6rb1SfuotMrepd8E25D6n7R52KtrL+OMzMy95QfCuPUpYovi+X3OLY1 0xd+UViMkAgIqXlEfLOkZ5eR0fFGXLM= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-10-HX4VwVr3Nb2sbSXcXxNWJA-1; Wed, 29 Apr 2026 08:52:33 -0400 X-MC-Unique: HX4VwVr3Nb2sbSXcXxNWJA-1 X-Mimecast-MFC-AGG-ID: HX4VwVr3Nb2sbSXcXxNWJA_1777467152 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-6708c9e05d8so8314426a12.2 for ; Wed, 29 Apr 2026 05:52:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467152; x=1778071952; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=848yXqrWfV86+DAaUBiE05fWjJwwzyvaF5Dj6MZA6OU=; b=KfAtBdkFgvBBg6EwfZhUU6QadFmc+C8TN2b/Csu70qLImESpAYtAfce9LQw/iaboDG kSC7XG1nQ341Qp4+dEWX37OpL664kzpblDszYaw9aW7kzZ2YlxrQuQvB11rblEwIbAqX Y08nVZw9PcHH/ZdH5mXoy5K2a7SRg4sugtiZ79IexvpnRvV7RoD2WIDYnxfzwN70HTe6 S9oTvtJIIEsHuMV5xxVhtmhtFFgHVVaE0pTXBOZbkhfUHuBDPhrJBhPukyd0i7JeMrX8 Q+eObemGnotjnGM3hyAbVRSZi/icObf7iEfOv+h++2Bp3jEdUAyJsXvNsTxXj0+0ln4H mepg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467152; x=1778071952; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=848yXqrWfV86+DAaUBiE05fWjJwwzyvaF5Dj6MZA6OU=; b=sdKwExVNLtVEEJh+202T19+uaBqeSHa82pM36prWJuTkZZ4q3P5AaGGhVzhpqx2oVk 2vBLY6HZhvKUmSoPpxQF2RW2tj27M7gnzNEvGKGFzCUAYbxgQnbM/nmxp8o6zf7v0EAB EFCwqh3ii3qEUFVTkpsQepYPyjx47lgNLlhk1mJDCYCVVHbi+3U9dhcPcgm+y+DM6YPX SzUn8VRYWx2qVMpsgxNhHQXmFTc9OiimO3lwgUHZGMi3IY6wVhEs8hOL4rOgSHeSgqMy KRIRe9tMilJinPygWPM8GDKChc1s7I9PCC+AHuo9r4ltPrFdJ9q7e+bhOx5jtimhGJsf FLBg== X-Gm-Message-State: AOJu0YyDRDqE+Eo+V0Oz2mP08PLwWe3ZQeZ+GJJ5nwWvbr0DjQWRm8Xd juoZBLHrIhn0j8o4xOriQV9+mVwCtZx555zGTqI2d5nFNCWw/oeuJkpSZmoYNoVfxJAY3dAPkQe Pd79/E02av9R+r6POhQvAmhg0v5iMxnD1nC8UDA5sys6fnkBC1lT+Xnec+cKZaXF2wA== X-Gm-Gg: AeBDieuXolBY4UnW9EBG71uwQX3jxuQu+LjZYoTLGIhbnbGAM9dGt5/iMy8jWK9sA67 WGiee1ehu6XSLRT1ic611346elIzL+GfqZkN27HYZdFz3DPlAs9LXm4AyY5G1Q5eghmyG/xpZK7 Vuj5T/kjkvTlbZYPh1kp2qROfi2Hwl2PA85SEmWH0sCdvlYmjeZ5Bam54k0vPBclbNQYa7ei1QG Fb6CVT30/3WtUlKnXNQ0sfRbfR/l/MgAiPv80oUsQJGJr8Ioh5oJhFgznHyVjlGYVD/jdrkUw0Y 5YT/LI2t72ItIOpYnZWFm/Z5bEsTBrVSSKdVTYNe1xWXLRs+AfgFBuyGm/BqAZj68NaaMnSP8jr ayWKE8DoD+zncOtuZ/YIUBjj8jLGXd55aCd1ajos92bn4CSv6F6NUdEH/0i4w71Nv/g== X-Received: by 2002:a05:6402:5484:b0:678:93eb:ca06 with SMTP id 4fb4d7f45d1cf-67b1fd96320mr2272594a12.7.1777467151805; Wed, 29 Apr 2026 05:52:31 -0700 (PDT) X-Received: by 2002:a05:6402:5484:b0:678:93eb:ca06 with SMTP id 4fb4d7f45d1cf-67b1fd96320mr2272572a12.7.1777467151026; Wed, 29 Apr 2026 05:52:31 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:30 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Date: Wed, 29 Apr 2026 12:52:04 +0000 Message-Id: <20260429125206.1512203-10-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add targeted corner-case tests for the CephFS manual session reset feature. Four sequential tests cover: [1/4] ebusy_rejection - second reset rejected while first in-flight [2/4] dirty_caps_at_reset - reset with unflushed dirty caps [3/4] flock_after_reset - stale lock EIO + fresh lock after holder ex= it [4/4] unmount_during_reset - umount during active reset (destroy-path wa= keup) Requires: mounted CephFS, debugfs access (root), flock(1) utility. Signed-off-by: Alex Markuze --- .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++++++ 1 file changed, 646 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_c= ases.sh diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh= b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh new file mode 100755 index 000000000000..a6dae84a616d --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh @@ -0,0 +1,646 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset corner case tests. +# Runs a checklist of targeted tests that exercise specific reset +# code paths not covered by the stress tests. +# +# Requires: mounted CephFS, debugfs access (root), flock(1) utility. + +set -uo pipefail + +KSFT_SKIP=3D4 + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" +DEBUGFS_CLIENT=3D"" +TRIGGER_PATH=3D"" +STATUS_PATH=3D"" +TEMP_MNT=3D"" + +PASS_COUNT=3D0 +FAIL_COUNT=3D0 +SKIP_COUNT=3D0 +TOTAL=3D4 + +log() +{ + printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1" +} + +result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"${4:-}" + + case "$status" in + PASS) PASS_COUNT=3D$((PASS_COUNT + 1)) ;; + FAIL) FAIL_COUNT=3D$((FAIL_COUNT + 1)) ;; + SKIP) SKIP_COUNT=3D$((SKIP_COUNT + 1)) ;; + esac + + if [[ -n "$detail" ]]; then + printf '[%d/%d] %-30s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$de= tail" + else + printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status" + fi +} + +read_status_field() +{ + local field=3D"$1" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$STATUS_PATH" 2>= /dev/null +} + +wait_reset_done() +{ + local timeout=3D"${1:-30}" + local elapsed=3D0 + + while [[ "$(read_status_field "phase")" !=3D "idle" ]]; do + sleep 1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge "$timeout" ]]; then + return 1 + fi + done + return 0 +} + +list_reset_clients() +{ + local entry + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + printf '%s\n' "$entry" + done +} + +wait_status_nonidle() +{ + local status_path=3D"$1" + local timeout=3D"${2:-10}" + local polls=3D$((timeout * 10)) + local phase + + while [[ "$polls" -gt 0 ]]; do + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$status_path" 2>/d= ev/null)" + if [[ -n "$phase" && "$phase" !=3D "idle" ]]; then + return 0 + fi + sleep 0.1 + polls=3D$((polls - 1)) + done + + return 1 +} + +discover_debugfs() +{ + local candidates=3D() + local entry + + if [[ -n "$DEBUGFS_CLIENT" ]]; then + if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then + echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2 + exit "$KSFT_SKIP" + fi + return 0 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + entry=3D"$(basename "$entry")" + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue + candidates+=3D("$entry") + done + + if [[ ${#candidates[@]} -eq 0 ]]; then + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" = >&2 + exit "$KSFT_SKIP" + fi + + if [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-= id." >&2 + exit "$KSFT_SKIP" + fi + + DEBUGFS_CLIENT=3D"${candidates[0]}" +} + +# --- Test 1: ebusy_rejection --------------------------------------------= ---- +# +# Trigger a reset while another is guaranteed in-flight. Creates +# dirty state so the first reset enters DRAINING (which takes +# measurable time), then polls until phase !=3D idle and issues the +# second trigger. The second trigger must fail (the kernel returns +# -EBUSY), and only one reset must be counted in the accounting. + +test_ebusy_rejection() +{ + local num=3D1 + local name=3D"ebusy_rejection" + local testfile=3D"$MOUNT_POINT/.reset_corner_ebusy_$$" + local tc_before tc_after sc_before sc_after second_rc phase elapsed + + tc_before=3D"$(read_status_field "trigger_count")" + sc_before=3D"$(read_status_field "success_count")" + + # Create dirty state so the first reset enters DRAINING + echo "ebusy_dirty_data" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_ebusy_test\n') +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + # Trigger the first reset -- it will drain dirty state + echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "first trigger failed" + rm -f "$testfile" + return + } + + # Poll until phase is non-idle (quiescing or draining) + elapsed=3D0 + while true; do + phase=3D"$(read_status_field "phase")" + if [[ "$phase" !=3D "idle" ]]; then + break + fi + sleep 0.1 + elapsed=3D$((elapsed + 1)) + if [[ "$elapsed" -ge 50 ]]; then + result "$num" "$name" SKIP \ + "first reset completed before overlap could be tested" + rm -f "$testfile" 2>/dev/null + return + fi + done + + # Issue the second trigger -- should be rejected with EBUSY + second_rc=3D0 + echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=3D0 || sec= ond_rc=3D$? + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "first reset never completed" + rm -f "$testfile" + return + fi + + tc_after=3D"$(read_status_field "trigger_count")" + sc_after=3D"$(read_status_field "success_count")" + + if [[ "$((tc_after - tc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$((sc_after - sc_before))" -ne 1 ]]; then + result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), ex= pected +1" + rm -f "$testfile" + return + fi + + if [[ "$second_rc" -eq 0 ]]; then + result "$num" "$name" FAIL "second trigger did not return error" + rm -f "$testfile" + return + fi + + rm -f "$testfile" 2>/dev/null + result "$num" "$name" PASS +} + +# --- Test 2: dirty_caps_at_reset ----------------------------------------= ---- +# +# Write to a file without fsync (dirty caps), trigger reset, then +# verify the file is not corrupt. Manual reset drains dirty caps +# before teardown (best-effort, 5s timeout). For a non-stuck cap +# the dirty write should be flushed during drain and persist. +# If the drain window is too short, only the synced first line +# persists -- that is acceptable (data loss is documented for +# unflushed writes). + +test_dirty_caps_at_reset() +{ + local num=3D2 + local name=3D"dirty_caps_at_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_dirty_caps_$$" + local content_after line_count sc_before sc_after le + + sc_before=3D"$(read_status_field "success_count")" + + echo "line_1_before_dirty_write" > "$testfile" + sync "$testfile" + + python3 -c " +import os, sys +fd =3D os.open('$testfile', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'line_2_dirty_no_fsync\n') +# deliberately no fsync -- leave caps dirty +sys.stdout.write('written') +" 2>/dev/null || { + result "$num" "$name" FAIL "dirty write failed" + rm -f "$testfile" + return + } + + echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || { + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + result "$num" "$name" FAIL "success_count did not increment (reset not e= xercised)" + rm -f "$testfile" + return + fi + + sync "$testfile" 2>/dev/null || true + content_after=3D"$(cat "$testfile" 2>/dev/null)" || { + result "$num" "$name" FAIL "cannot read file after reset" + rm -f "$testfile" + return + } + + if [[ -z "$content_after" ]]; then + result "$num" "$name" FAIL "file is empty after reset" + rm -f "$testfile" + return + fi + + line_count=3D"$(echo "$content_after" | wc -l)" + if [[ "$line_count" -lt 1 ]]; then + result "$num" "$name" FAIL "file has $line_count lines, expected >=3D 1" + rm -f "$testfile" + return + fi + + echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || { + result "$num" "$name" FAIL "first line corrupted" + rm -f "$testfile" + return + } + + le=3D"$(read_status_field "last_errno")" + if [[ "$le" !=3D "0" ]]; then + result "$num" "$name" FAIL "last_errno=3D$le, expected 0" + rm -f "$testfile" + return + fi + + rm -f "$testfile" + result "$num" "$name" PASS "file intact ($line_count lines)" +} + +# --- Test 3: flock_after_reset ------------------------------------------= ---- +# +# Take an exclusive flock, trigger reset, verify stale lock state is +# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns +# EIO). After the original holder exits (releasing the local lock +# reference and clearing the error flag), a fresh lock can be acquired. +# +# The lock holder uses the fd-based flock form with exec, so killing +# $lock_pid closes the lock fd immediately (no orphaned child with an +# inherited fd copy that would prevent the VFS flock release). + +test_flock_after_reset() +{ + local num=3D3 + local name=3D"flock_after_reset" + local testfile=3D"$MOUNT_POINT/.reset_corner_flock_$$" + local lock_pid probe_rc sc_before sc_after + + sc_before=3D"$(read_status_field "success_count")" + + echo "flock_test_content" > "$testfile" + sync "$testfile" + + # Hold lock via fd in a subshell; exec ensures killing $lock_pid + # closes the lock fd directly (no fork/child fd inheritance). + ( + exec 9<"$testfile" + flock --exclusive --nonblock 9 || exit 1 + exec sleep 300 + ) & + lock_pid=3D$! + sleep 0.5 + + if ! kill -0 "$lock_pid" 2>/dev/null; then + result "$num" "$name" FAIL "flock holder died immediately" + rm -f "$testfile" + return + fi + + echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || { + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset trigger failed" + rm -f "$testfile" + return + } + + if ! wait_reset_done 30; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "reset did not complete" + rm -f "$testfile" + return + fi + + sc_after=3D"$(read_status_field "success_count")" + if [[ "$sc_after" -le "$sc_before" ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL "success_count did not increment" + rm -f "$testfile" + return + fi + + # After teardown, CEPH_I_ERROR_FILELOCK is set on the inode. + # A same-client lock attempt should fail (EIO), NOT succeed. + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=3D0= || probe_rc=3D$? + if [[ "$probe_rc" -eq 0 ]]; then + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null + result "$num" "$name" FAIL \ + "same-client probe succeeded, expected EIO from stale lock state" + rm -f "$testfile" + return + fi + + # Kill the holder -- the exec'd sleep IS $lock_pid, so killing it + # closes fd 9 directly. VFS flock release fires ceph_fl_release_lock(), + # which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK. + kill "$lock_pid" 2>/dev/null + wait "$lock_pid" 2>/dev/null + + # After the holder exits, a fresh lock should be acquirable. + # The reset teardown sends SESSION_REQUEST_CLOSE so the MDS + # releases locks promptly, but retry briefly in case the + # message races with the connection close. + local attempt + probe_rc=3D1 + for attempt in 1 2 3 4 5; do + probe_rc=3D0 + flock --exclusive --nonblock "$testfile" true 2>/dev/null \ + && probe_rc=3D0 || probe_rc=3D$? + [[ "$probe_rc" -eq 0 ]] && break + sleep 1 + done + if [[ "$probe_rc" -ne 0 ]]; then + result "$num" "$name" FAIL \ + "cannot acquire fresh lock after holder exit (rc=3D$probe_rc, ${attempt= } attempts)" + rm -f "$testfile" + return + fi + + # Verify file content survived + grep -q "flock_test_content" "$testfile" 2>/dev/null || { + result "$num" "$name" FAIL "file content corrupted after reset" + rm -f "$testfile" + return + } + + rm -f "$testfile" + result "$num" "$name" PASS "stale lock detected, fresh lock acquired afte= r holder exit" +} + +# --- Test 4: unmount_during_reset ---------------------------------------= ---- +# +# Mount a fresh CephFS, trigger reset, immediately unmount. The +# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN +# and not hang. + +test_unmount_during_reset() +{ + local num=3D4 + local name=3D"unmount_during_reset" + local temp_mnt=3D"/tmp/ceph_corner_mnt_$$" + local mount_opts=3D"" + local mount_src=3D"" + local temp_trigger=3D"" + local temp_status=3D"" + local temp_client=3D"" + local temp_file=3D"$temp_mnt/.reset_corner_umount_$$" + local phase=3D"" + local trigger_ok=3D0 + local attempt + local -a new_clients=3D() + declare -A existing_clients=3D() + + mount_src=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "cep= h" {print $1; exit}' /proc/mounts 2>/dev/null)" + mount_opts=3D"$(awk -v mp=3D"$MOUNT_POINT" '$2 =3D=3D mp && $3 =3D=3D "ce= ph" {print $4; exit}' /proc/mounts 2>/dev/null)" + + if [[ -z "$mount_src" ]]; then + result "$num" "$name" SKIP "cannot determine mount source from /proc/mou= nts" + return + fi + + while IFS=3D read -r existing; do + [[ -n "$existing" ]] || continue + existing_clients["$existing"]=3D1 + done < <(list_reset_clients) + + mkdir -p "$temp_mnt" + + if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null;= then + result "$num" "$name" SKIP "cannot mount additional CephFS instance" + rmdir "$temp_mnt" 2>/dev/null + return + fi + + ls "$temp_mnt" > /dev/null 2>&1 + sync + sleep 1 + + for attempt in $(seq 1 50); do + new_clients=3D() + while IFS=3D read -r entry; do + [[ -n "$entry" ]] || continue + if [[ -n "${existing_clients[$entry]+x}" ]]; then + continue + fi + new_clients+=3D("$entry") + done < <(list_reset_clients) + + if [[ "${#new_clients[@]}" -eq 1 ]]; then + temp_client=3D"${new_clients[0]}" + break + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + break + fi + + sleep 0.1 + done + + if [[ -z "$temp_client" ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "cannot identify debugfs client for temp moun= t" + return + fi + + if [[ "${#new_clients[@]}" -gt 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" SKIP "multiple new debugfs clients appeared" + return + fi + + temp_trigger=3D"$DEBUGFS_ROOT/$temp_client/reset/trigger" + temp_status=3D"$DEBUGFS_ROOT/$temp_client/reset/status" + + echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot create dirty state on temp mount" + return + } + sync "$temp_file" + python3 -c " +import os, sys +fd =3D os.open('$temp_file', os.O_WRONLY | os.O_APPEND) +os.write(fd, b'dirty_for_umount_test\\n') +os.close(fd) +" 2>/dev/null || { + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap" + return + } + + echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=3D1 || tr= igger_ok=3D0 + if [[ "$trigger_ok" -ne 1 ]]; then + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "cannot trigger reset on temp mount" + return + fi + + if ! wait_status_nonidle "$temp_status" 10; then + phase=3D"$(awk -F': ' '$1 =3D=3D "phase" {print $2}' "$temp_status" 2>/d= ev/null)" + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL \ + "reset never became active before umount (phase=3D${phase:-unknown})" + return + fi + + local umount_ok=3D0 + timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=3D1 + + if [[ "$umount_ok" -ne 1 ]]; then + umount -l "$temp_mnt" 2>/dev/null || true + rmdir "$temp_mnt" 2>/dev/null + result "$num" "$name" FAIL "umount hung for >30s" + return + fi + + rmdir "$temp_mnt" 2>/dev/null + + ls "$MOUNT_POINT" > /dev/null 2>&1 || { + result "$num" "$name" FAIL "original mount unhealthy after test" + return + } + + result "$num" "$name" PASS +} + +# --- Main ---------------------------------------------------------------= ----- + +usage() +{ + cat < [--client-id ] [--debugfs-root ] + +Runs targeted corner-case tests for the CephFS client reset feature. +Requires root (debugfs access) and a mounted CephFS filesystem. + +Options: + --mount-point PATH CephFS mount point (required) + --client-id ID Ceph debugfs client id (auto-detect if one client) + --debugfs-root PATH Debugfs ceph root (default: /sys/kernel/debug/cep= h) + --help Show this message +EOF +} + +main() +{ + while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --client-id) DEBUGFS_CLIENT=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac + done + + if [[ -z "$MOUNT_POINT" ]]; then + echo "--mount-point is required" >&2 + usage + exit 2 + fi + + if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" + fi + + discover_debugfs + TRIGGER_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger" + STATUS_PATH=3D"$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status" + + log "CephFS client reset corner case tests" + log "Mount: $MOUNT_POINT" + log "Client: $DEBUGFS_CLIENT" + echo "" + + test_ebusy_rejection + test_dirty_caps_at_reset + test_flock_after_reset + test_unmount_during_reset + + echo "" + echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skippe= d (of $TOTAL)" + + if [[ "$FAIL_COUNT" -gt 0 ]]; then + exit 1 + fi + exit 0 +} + +main "$@" --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7E6F3FFACC for ; Wed, 29 Apr 2026 12:52:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467159; cv=none; b=a+ZoL9DNfWBsNFTf4nLsqMYlBDhLw5UHrp2SXqnXivZDhuEq2cS1kn3EAo8gr80+FcNFa9BnwGAdd7zWc7tj9mty9DO/YrUHVVnJr0R4+opNaq9QLxDRuWlZIS+v5QLNUsArAPAkn/aEILMQPvjibTxffDErpjWvBsTE4gY790k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467159; c=relaxed/simple; bh=tn7xy5Ayn+zIa11ASIUS5cpHLtLo4gvFlZuKOLg8ggQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=pQQSTr9ThoruKlFyA11O9v/BScfJTNGzgOz2+/7DHCtpl4jZC3zsrYmGiqcwLxT62gcS79n5CnsUgVAL+ELvhdjRQym/P/vTkzOHoUyOx28vm6FVRaBv7YIk8q3t0GX9VilvM+pK1I0EC99K/fx3qfFWNvyz3JDuEBszI8TCZXs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=EQ7Z0brD; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=pOliXWP/; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EQ7Z0brD"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="pOliXWP/" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467156; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=EQ7Z0brDhQuyT6dqs8vkUj/DZYZHumj1TSG+FZIPPHXajnyEZL/cb3lScO/cnYoPHFYLDr 6BjC/ctADQjcbxsjgPz4WlYE9HIyzIrYg1coWzpS1OsJQHqZ9B23LBTPMxubRRmvWJkhPU BWJw/DcFFtL8ysQuYpdnxmBiYtVwuVc= Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-252-13F12QvMPlOdQ7KkpmV4Aw-1; Wed, 29 Apr 2026 08:52:34 -0400 X-MC-Unique: 13F12QvMPlOdQ7KkpmV4Aw-1 X-Mimecast-MFC-AGG-ID: 13F12QvMPlOdQ7KkpmV4Aw_1777467154 Received: by mail-ed1-f70.google.com with SMTP id 4fb4d7f45d1cf-66c1806dd5fso11200889a12.3 for ; Wed, 29 Apr 2026 05:52:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467153; x=1778071953; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=pOliXWP/WLjz8rH+JLNU702DxinieMQnJzHondJ1LCeUt6mOy1/JLODhKAGUiUzUrh 7W+w0RzpkDvbHzedypG5mt1tYeI8Z6XM73vI1YphKdsOGOsA+E3n2hTiXHo7+/v3ruDQ NVgUjZ5HMZbpDFYNFcQeSQnP9ZeWvh/fastgsZFx/XIqFyfDciv64rC0NMynE3v8Brvw LhB9C+/6P7MOob6ZvjQg2O9KYjzxDLpQ2xISdjTRzqFz9uKDRFVGz4Sip4uPgdJpVRfr DKYIPxNdOV1ipmxdktLlOWqrIx55BTBSiZfXi5hr72ntpXmnBnEiPjx7k2LdT0IcnehQ tAWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467153; x=1778071953; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=0K6M5C2sXobTuSTLF/q+6rql2ekW3akwO+6JwZKqbME=; b=XgixCwazvXcZ+In7WhBN4B9MR0xPJox4DvHB/vjvqn6has7PXW0fr7+BMLd+UeeLQN LlTYYSqdN4gKr7rad+CBU+iEVmtbWU5+nDOFLSNwCDWZpmF4g2YiSGfrbuGMCodKgBdS Q5jOFE2lKCKi2/iReKVkCEwgUZZhcWKFc7POQ5jjuk+CwPU0vKitT8zZZF2/oVa9Zsnl 5rPR+lWhKRRJpy3UsbWHD1bLKboQ1631v4c4JMlIDxzhJjWU/mX6OefAMWgekpxQAA/B ZlbfagG5Iupgju8HQyfcwHs8ciyE7AoOI+Az+omissDjtjhlYUTworHZBFEFpbNF957g IxJQ== X-Gm-Message-State: AOJu0YyXFAS7wcpej3JydtlNfN2vTlR64gRI8If7E5suj+WAfpOCA02X hQXCSCvLQDpt8R3Y3gmL8fIULPnE8yEuLcKdxlhxJ2ic5NuhPGRBUx5Kv4PjTk7a/cKHC2gfYCy abV/Ou4fBPPyJpTS6Y5+kg72hLqy7WXdFXV1LfrrjICKUYVdgwOyKONfrVR630GzlWw== X-Gm-Gg: AeBDiev1kvBxvWPpodZBtNtO3MTr7Vze5tZF2biG6PuRcfIJEON6wunSjsoOWvpXUC7 UrZw1VfIyF+x5Rla5400WUDgQR/u4JagnpugUiwzdNr82hLgvQ+pacuNfKywapHht0X7t9B/x62 GNHG1nb0Zuaf77JS0HsWWAOosDv0XrJNK3CowaJTzjVQP2ChFYghlgSXI3+P7ArT80uac9OouQO qWsW87g3irJlXdLET/+/5J7jI8/yprbpaA7pkyYlnzinHYKz+OIg8L5+g93azKjgqGxp58xuT2J tb/MaKpIsLom2NBK6EQDIGMMvJ9vQMpKOVEPHG9BmaNgx1A6f8BOYNs6wt4Wgx5a6sE1W6VJ5W4 1KeFJOoiCRb/fGrmk5z1nKNy203n8mQVScegwjqfiNfUlyJDkxYAmrmHZmmO/gDZpSA== X-Received: by 2002:a05:6402:1f89:b0:679:1fb6:5c71 with SMTP id 4fb4d7f45d1cf-679bb0969f7mr3816266a12.22.1777467153120; Wed, 29 Apr 2026 05:52:33 -0700 (PDT) X-Received: by 2002:a05:6402:1f89:b0:679:1fb6:5c71 with SMTP id 4fb4d7f45d1cf-679bb0969f7mr3816243a12.22.1777467152450; Wed, 29 Apr 2026 05:52:32 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:31 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 10/11] selftests: ceph: add validation harness Date: Wed, 29 Apr 2026 12:52:05 +0000 Message-Id: <20260429125206.1512203-11-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a one-shot validation wrapper that orchestrates the full reset test suite with per-stage watchdog timeouts and a final status check. The harness runs five stages: baseline (no resets), corner cases, moderate stress, aggressive stress, and a post-run status validation. Each stage runs with an independent timeout so a hang in one stage does not block the entire run. Signed-off-by: Alex Markuze --- .../filesystems/ceph/run_validation.sh | 350 ++++++++++++++++++ 1 file changed, 350 insertions(+) create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation= .sh diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/t= ools/testing/selftests/filesystems/ceph/run_validation.sh new file mode 100755 index 000000000000..5d521e4f9e9b --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh @@ -0,0 +1,350 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# CephFS client reset - single-command validation. +# Runs all test stages in sequence with per-stage timeouts. +# If any stage hangs (filesystem stuck, process blocked), the +# timeout kills it and reports failure. +# +# Usage: +# sudo ./run_validation.sh --mount-point /mnt/mycephfs +# +# Expected output on success: +# +# =3D=3D=3D CephFS Client Reset Validation =3D=3D=3D +# [stage 1/5] baseline PASS (60s, no resets) +# [stage 2/5] corner_cases PASS (4/4 passed) +# [stage 3/5] moderate PASS (120s, resets every 5-15s) +# [stage 4/5] aggressive PASS (120s, resets every 1-5s) +# [stage 5/5] status_check PASS (phase=3Didle, last_errno=3D0) +# +# RESULT: 5/5 stages passed +# Artifacts: /tmp/ceph_reset_validation_ + +set -uo pipefail + +KSFT_SKIP=3D4 +SCRIPT_DIR=3D"$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# kselftest auto-detect: when invoked with no arguments (e.g. by +# "make run_tests"), find a CephFS mount automatically or skip. +if [[ $# -eq 0 ]]; then + MOUNT_POINT=3D"$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)" + if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: No CephFS mount found and --mount-point not specified" + exit "$KSFT_SKIP" + fi + exec "$0" --mount-point "$MOUNT_POINT" +fi + +MOUNT_POINT=3D"" +CLIENT_ID=3D"" +declare -a CLIENT_ARGS=3D() +declare -a DEBUGFS_ARGS=3D() +RUN_ID=3D"$(date +%Y%m%d-%H%M%S)" +OUT_DIR=3D"/tmp/ceph_reset_validation_${RUN_ID}" +DEBUGFS_ROOT=3D"/sys/kernel/debug/ceph" + +# Timeout margins: stage runtime + cooldown + validation + safety buffer +STAGE1_TIMEOUT=3D120 # 60s run + 20s cooldown + 40s buffer +STAGE2_TIMEOUT=3D300 # 4 corner cases, 30s each worst case + buffer +STAGE3_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE4_TIMEOUT=3D240 # 120s run + 20s cooldown + 100s buffer +STAGE5_TIMEOUT=3D10 # just reading debugfs + +PASS=3D0 +FAIL=3D0 +TOTAL=3D5 + +usage() +{ + cat < [options] + +Required: + --mount-point PATH CephFS mount point + +Options: + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_valid= ation_) + --client-id ID Ceph debugfs client id (optional) + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph) + --help Show this message +EOF +} + +stage_result() +{ + local num=3D"$1" + local name=3D"$2" + local status=3D"$3" + local detail=3D"$4" + + if [[ "$status" =3D=3D "PASS" ]]; then + PASS=3D$((PASS + 1)) + else + FAIL=3D$((FAIL + 1)) + fi + printf '[stage %d/%d] %-16s %s (%s)\n' "$num" "$TOTAL" "$name" "$status"= "$detail" +} + +# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout. +# Sets RUN_TIMED_OUT=3D1 if killed by timeout. +# +# The stage command runs in its own session/process group (via setsid). +# On timeout the entire process group is killed, not just the top-level +# script PID. This is required because stage scripts (reset_stress.sh, +# reset_corner_cases.sh) spawn child processes - I/O workers, rename +# workers, reset injectors, samplers - that would otherwise survive the +# timeout and bleed into later stages, invalidating results. +RUN_TIMED_OUT=3D0 + +run_with_timeout() +{ + local timeout_sec=3D"$1" + local logfile=3D"$2" + shift 2 + + RUN_TIMED_OUT=3D0 + + # Start the stage in its own session via setsid so all descendant + # processes share a process group that we can kill atomically. + # In a non-interactive script, background children are not process + # group leaders, so setsid(1) calls setsid(2) directly (no extra + # fork) and the PID we capture IS the group leader. + setsid "$@" > "$logfile" 2>&1 & + local pid=3D$! + + # Watchdog: on timeout, kill the entire process group + ( + sleep "$timeout_sec" + if kill -0 "$pid" 2>/dev/null; then + echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $p= id" >> "$logfile" + kill -TERM -- -"$pid" 2>/dev/null + sleep 2 + kill -KILL -- -"$pid" 2>/dev/null + fi + ) & + local watchdog_pid=3D$! + + # Wait for the stage command + wait "$pid" 2>/dev/null + local rc=3D$? + + # Kill the watchdog if it's still running + kill "$watchdog_pid" 2>/dev/null + wait "$watchdog_pid" 2>/dev/null + + # Check if it was killed by timeout + if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then + RUN_TIMED_OUT=3D1 + return 1 + fi + + return "$rc" +} + +find_status_path() +{ + local entry + + if [[ -n "$CLIENT_ID" ]]; then + if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then + echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" + return 0 + fi + return 1 + fi + + for entry in "$DEBUGFS_ROOT"/*/; do + if [[ -r "${entry}reset/status" ]]; then + echo "${entry}reset/status" + return 0 + fi + done + return 1 +} + +read_status_field() +{ + local status_path=3D"$1" + local field=3D"$2" + awk -F': ' -v key=3D"$field" '$1 =3D=3D key {print $2}' "$status_path" 2>= /dev/null +} + +# --- Parse arguments ----------------------------------------------------= --- + +while [[ $# -gt 0 ]]; do + case "$1" in + --mount-point) MOUNT_POINT=3D"$2"; shift 2 ;; + --out-dir) OUT_DIR=3D"$2"; shift 2 ;; + --client-id) CLIENT_ID=3D"$2"; shift 2 ;; + --debugfs-root) DEBUGFS_ROOT=3D"$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) echo "Unknown option: $1" >&2; usage; exit 2 ;; + esac +done + +if [[ -z "$MOUNT_POINT" ]]; then + echo "SKIP: --mount-point is required" >&2 + usage + exit "$KSFT_SKIP" +fi + +if [[ ! -d "$MOUNT_POINT" ]]; then + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi + +# Auto-detect client id when not specified, so all stages (including +# stage 5 status check) use the same client consistently. +if [[ -z "$CLIENT_ID" ]]; then + candidates=3D() + for entry in "$DEBUGFS_ROOT"/*/; do + name=3D"$(basename "$entry")" + if [[ -r "${entry}reset/status" ]]; then + candidates+=3D("$name") + fi + done + if [[ ${#candidates[@]} -eq 1 ]]; then + CLIENT_ID=3D"${candidates[0]}" + elif [[ ${#candidates[@]} -gt 1 ]]; then + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client= -id." >&2 + exit "$KSFT_SKIP" + fi +fi + +if [[ -n "$CLIENT_ID" ]]; then + CLIENT_ARGS=3D(--client-id "$CLIENT_ID") +fi +DEBUGFS_ARGS=3D(--debugfs-root "$DEBUGFS_ROOT") + +# Quick sanity: can we write to the mount? +if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2 + exit "$KSFT_SKIP" +fi +rm -f "$MOUNT_POINT/.validation_probe_$$" + +mkdir -p "$OUT_DIR" + +echo "" +echo "=3D=3D=3D CephFS Client Reset Validation =3D=3D=3D" +echo "" + +# --- Stage 1: Baseline (no resets) --------------------------------------= --- + +stage1_out=3D"$OUT_DIR/stage1_baseline" +if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile baseline \ + --no-reset \ + --duration-sec 60 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage1_out"; then + stage_result 1 "baseline" "PASS" "60s, no resets" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s" +else + stage_result 1 "baseline" "FAIL" "see $stage1_out.log" +fi + +# --- Stage 2: Corner cases ----------------------------------------------= --- + +stage2_out=3D"$OUT_DIR/stage2_corner_cases" +mkdir -p "$stage2_out" +if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \ + "$SCRIPT_DIR/reset_corner_cases.sh" \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --mount-point "$MOUNT_POINT"; then + pass_line=3D$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$s= tage2_out/output.log" | tail -1) + stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT= }s" +else + fail_line=3D$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo= "?") + stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_= out/output.log" +fi + +# --- Stage 3: Moderate resets -------------------------------------------= ---- + +stage3_out=3D"$OUT_DIR/stage3_moderate" +if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile moderate \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage3_out"; then + stage_result 3 "moderate" "PASS" "120s, resets every 5-15s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s" +else + stage_result 3 "moderate" "FAIL" "see $stage3_out.log" +fi + +# --- Stage 4: Aggressive resets -----------------------------------------= ---- + +stage4_out=3D"$OUT_DIR/stage4_aggressive" +if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \ + "$SCRIPT_DIR/reset_stress.sh" \ + --mount-point "$MOUNT_POINT" \ + --profile aggressive \ + --duration-sec 120 \ + "${CLIENT_ARGS[@]}" \ + "${DEBUGFS_ARGS[@]}" \ + --out-dir "$stage4_out"; then + stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s" +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then + stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s" +else + stage_result 4 "aggressive" "FAIL" "see $stage4_out.log" +fi + +# --- Stage 5: Post-run status check -------------------------------------= --- + +status_path=3D"" +if status_path=3D$(find_status_path); then + phase=3D$(read_status_field "$status_path" "phase") + last_errno=3D$(read_status_field "$status_path" "last_errno") + failure_count=3D$(read_status_field "$status_path" "failure_count") + drain_timed_out=3D$(read_status_field "$status_path" "drain_timed_out") + sessions_reset=3D$(read_status_field "$status_path" "sessions_reset") + blocked=3D$(read_status_field "$status_path" "blocked_requests") + + # Save full status + cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null + + errors=3D"" + [[ "$phase" !=3D "idle" ]] && errors=3D"${errors}phase=3D$phase " + [[ "$last_errno" !=3D "0" ]] && errors=3D"${errors}last_errno=3D$last_err= no " + [[ "$failure_count" !=3D "0" && -n "$failure_count" ]] && errors=3D"${err= ors}failure_count=3D$failure_count " + [[ "$blocked" !=3D "0" ]] && errors=3D"${errors}blocked_requests=3D$block= ed " + + if [[ -z "$errors" ]]; then + detail=3D"phase=3D$phase, last_errno=3D$last_errno, failure_count=3D${fa= ilure_count:-0}" + [[ "$drain_timed_out" =3D=3D "yes" ]] && detail=3D"$detail, drain_timed_= out=3Dyes" + [[ -n "$sessions_reset" ]] && detail=3D"$detail, sessions_reset=3D$sessi= ons_reset" + stage_result 5 "status_check" "PASS" "$detail" + else + stage_result 5 "status_check" "FAIL" "$errors" + fi +else + stage_result 5 "status_check" "FAIL" "cannot read reset/status" +fi + +# --- Summary ------------------------------------------------------------= ---- + +echo "" +if [[ "$FAIL" -eq 0 ]]; then + echo "RESULT: $PASS/$TOTAL stages passed" +else + echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED" +fi +echo "Artifacts: $OUT_DIR" +echo "" + +exit "$FAIL" --=20 2.34.1 From nobody Tue Jun 16 19:34:27 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AEF464014A4 for ; Wed, 29 Apr 2026 12:52:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467160; cv=none; b=sw1E+gXXahHMRBMHQlXXJyy1pG/ztACovSUToW+41bNSoDrFS+8WS0bGvCIMS5KRdl9Lt/cT1/T3ho6LqzJiVx/WfSdK+1nqTtlI370E9DuH4iMSPsgPKSG4UkoyjqI3kFHejv3r3JSGJNa4kFkfOpBM7fdGGpjFLxqVFOAW0KY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777467160; c=relaxed/simple; bh=cvmRKyXcgWy+wVZel1Y334Cia/srHcuaT6Zl2noNFDg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YfKhg2m/PsKCAhrR7dT4buDVl2IebXetIZbVS407rNFCmujdL53s+EoY4KnOCJutjvJjciHR96QD3/Je6u9CerUuOTzmp2bcQfphncztAMtGUwsF98ug0XskfWaLBBG3v+n3ebvw6oE/17SoNkRyoOltKr+bATbCOrVU4IYGhzo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZebNEVix; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=rRKXeXl8; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZebNEVix"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="rRKXeXl8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1777467157; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PQ3yFm+nsOqIg5UchE4FK+BGBUgY4XyMIBuZ7FNX6Dw=; b=ZebNEVixlB+F5dUqLIyESKtHFW/m5Xgdjn7tiSoWoO0CY3ILd+A49ZYYJkQGzhZvcCox4w bMQoSD9uik4mJtY19kKa39Jbz2gQholcYbXSPXrkTkG3QUKzcZQxEHdSY2R+jLAjofKI7p jyN1F8kZRin9++2F8r8wbeYHpc+O87o= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-191-UVzvcWi1MoqfflSTywOUVQ-1; Wed, 29 Apr 2026 08:52:36 -0400 X-MC-Unique: UVzvcWi1MoqfflSTywOUVQ-1 X-Mimecast-MFC-AGG-ID: UVzvcWi1MoqfflSTywOUVQ_1777467155 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-bb3d5c00b44so74584466b.1 for ; Wed, 29 Apr 2026 05:52:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1777467155; x=1778071955; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=PQ3yFm+nsOqIg5UchE4FK+BGBUgY4XyMIBuZ7FNX6Dw=; b=rRKXeXl8Hn3pFTZPQNNlBUDage74PCNDjzNL0M0Uml/1scAEz6jS0JHF7Tc/CYxQFp vWbIU9EKvI+5rwzUWPH1A+4IDTUdB4t26QBj1J26z16QVJolo7e1xcUa29rtCCnzz58L SYj6JM3cZeGRfaizAj8h3pn1TSsAfsPNIE/XZPN5+prSGPbKPdMmM0BMGvphl2GfC1dN B31F3PWXPA1awJcZUhbvdrGYq1iSfGO3codzEjWN6ZejOWOkdOy3kTX1XX/8cmc/p+9i CcnQ1Z/SNYYiTJHRU+ipZXNwWThxJzfjXpdxmRfCq25qO6LEPfq+vmxby1VKMPSXPBKp XoKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777467155; x=1778071955; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=PQ3yFm+nsOqIg5UchE4FK+BGBUgY4XyMIBuZ7FNX6Dw=; b=R1G5z2DNLpXFNC1JxMVz1aRs0buh5xar2p22rdeDqB9CGPLGt65iHyEacF73jkmZrP aHTWNZ2sqaRJ9pRm0XMkjdPyp9LNTcc5DNxwaVERCzkWcorDPPOUXcJabXzWXJ5qbQOh W0y4nMpqpJjEyzrA+ZQIp6TPKAuNP6eLonA74Ts8awyXXE4yr6s1MeLn51+hd3lnagD/ jjfW01BfobKwFUQS1f1en1Ltb+n3asVbSLQagxXO8tcIwJajZkiuFQSzLC1yhrHMIWUn zjIIpa/wjUpZlvMdeyZaa59DjD7qkMWdKYYZhhUAsgODcGfadv+HhfPfe2W+QIYDZ4f9 ADKA== X-Gm-Message-State: AOJu0YyKL9g29zw4xZ9/+FSMEEHWIlKrBt+2h/IurxWNJSFxqa2nHSx1 Gib3IjAtuw6xmPTOldtHN6C6DC24VhKGe9GhjkHgrFJz7Uh1DH3En/u9V12okbrdiBjUdkqZ279 529o8GrPQWDJa4sI73K9cyklMeTHX3Nio9peHKim8HIHqpLkE1pOzWCrFt08+k0S1Vlj2kIE7GQ fl X-Gm-Gg: AeBDievbhhiNHcrL9TuFidtPWn9rV0qqlgIUaRScAZSG/XdfiLkwK7miSI4M0wYX6ST bJ40xR4oNxFMRK98lHgd/nugIZ1GWTkwy2HP9OnnbrJueygVHkSIo5hUrqiJRmO/beIbKEq0LVf QolIbFkrLRhEJH9uoGu0KCXsLmRjTmrmUPsDqym4YC1YhsuTKy7PVfOl0aFXPa0FJGibfYMeIov SXHnX0pzfQM0o1STMIF7+zOfsKFuaWSrS7UvbjmtO9HfkWyw0aGNP70JZJWLupaht7tc3trGP+r iceBcr9MVBiUSL+zwhOkGQkPSVPy7v7eQddgAe6DJJPpAea1v+f5RsClWadny1rRwch/+3yYKgJ cTz6B9YuzXUwZpgyj+Id79VLdWAxzglvxKSvfdnsZfTu8CtTRMmyboKfqrQPIbKSZgw== X-Received: by 2002:a17:907:3cc6:b0:bba:302d:6c90 with SMTP id a640c23a62f3a-bba302d6d0fmr47831966b.10.1777467154776; Wed, 29 Apr 2026 05:52:34 -0700 (PDT) X-Received: by 2002:a17:907:3cc6:b0:bba:302d:6c90 with SMTP id a640c23a62f3a-bba302d6d0fmr47830466b.10.1777467154333; Wed, 29 Apr 2026 05:52:34 -0700 (PDT) Received: from cluster.. (4f.55.790d.ip4.static.sl-reverse.com. [13.121.85.79]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-67b22166a6esm680526a12.25.2026.04.29.05.52.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Apr 2026 05:52:33 -0700 (PDT) From: Alex Markuze To: ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com, vdubeyko@redhat.com, Alex Markuze Subject: [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Date: Wed, 29 Apr 2026 12:52:06 +0000 Message-Id: <20260429125206.1512203-12-amarkuze@redhat.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260429125206.1512203-1-amarkuze@redhat.com> References: <20260429125206.1512203-1-amarkuze@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Wire the CephFS reset test suite into the kselftest build: - Add filesystems/ceph to the top-level selftests Makefile. - Add the per-suite Makefile with run_validation.sh as TEST_PROGS. - Add the settings file (kselftest timeout). - Add the MAINTAINERS entry for the test directory. - Add README with prerequisites, usage, and troubleshooting. Signed-off-by: Alex Markuze --- MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/filesystems/ceph/Makefile | 7 ++ .../testing/selftests/filesystems/ceph/README | 84 +++++++++++++++++++ .../selftests/filesystems/ceph/settings | 1 + 5 files changed, 94 insertions(+) create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile create mode 100644 tools/testing/selftests/filesystems/ceph/README create mode 100644 tools/testing/selftests/filesystems/ceph/settings diff --git a/MAINTAINERS b/MAINTAINERS index d1cc0e12fe1f..87c36a26c1f2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5917,6 +5917,7 @@ B: https://tracker.ceph.com/ T: git https://github.com/ceph/ceph-client.git F: Documentation/filesystems/ceph.rst F: fs/ceph/ +F: tools/testing/selftests/filesystems/ceph/ =20 CERTIFICATE HANDLING M: David Howells diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Mak= efile index 450f13ba4cca..81c01a7062e0 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -32,6 +32,7 @@ TARGETS +=3D exec TARGETS +=3D fchmodat2 TARGETS +=3D filesystems TARGETS +=3D filesystems/binderfs +TARGETS +=3D filesystems/ceph TARGETS +=3D filesystems/epoll TARGETS +=3D filesystems/fat TARGETS +=3D filesystems/overlayfs diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/test= ing/selftests/filesystems/ceph/Makefile new file mode 100644 index 000000000000..4ad3e8d40d90 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_PROGS :=3D run_validation.sh +TEST_FILES :=3D reset_stress.sh reset_corner_cases.sh \ + validate_consistency.py README settings + +include ../../lib.mk diff --git a/tools/testing/selftests/filesystems/ceph/README b/tools/testin= g/selftests/filesystems/ceph/README new file mode 100644 index 000000000000..eb0092b38f80 --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/README @@ -0,0 +1,84 @@ +# CephFS Client Reset Test Suite + +Test suite for the CephFS kernel client manual session reset feature. +This trimmed set contains the single-client stress test, the targeted +corner-case test, and the one-shot validation harness used during +feature bring-up. + +## Prerequisites + +- Linux kernel with the CephFS client reset feature (this branch) +- A running Ceph cluster with at least one MDS +- Root access (debugfs requires it) +- Python 3 (for validators) +- flock utility (for lock tests, usually in util-linux) + +## Test inventory + +| Test | Script(s) | What it covers | +|------|-----------|----------------| +| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity= on one mount | +| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclai= m, unmount-during-reset | +| Validation harness | `run_validation.sh` | baseline + corner cases + mod= erate/aggressive stress + final status check | + +## Quick start + +Stress run: + + sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate + +Corner cases: + + sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs + +End-to-end validation: + + sudo ./run_validation.sh --mount-point /mnt/cephfs + +## Stress profiles + + baseline - no resets, 1 IO + 1 rename, 600s + moderate - reset every 5-15s, 2 IO + 1 rename, 900s + aggressive - reset every 1-5s, 4 IO + 2 rename, 900s + soak - reset every 5-15s, 2 IO + 1 rename, 3600s + +## Key options (all scripts) + + --mount-point PATH CephFS mount point (required) + --client-id ID Debugfs client id (auto-detected if one) + +reset_stress.sh additionally accepts: + + --profile NAME baseline|moderate|aggressive|soak + --duration-sec N Override profile runtime + --no-reset Disable reset injection + --out-dir PATH Artifact directory + +## Corner case tests + + [1/4] ebusy_rejection Second reset rejected while first in-flight + [2/4] dirty_caps_at_reset Reset with unflushed dirty caps + [3/4] flock_after_reset Stale lock EIO + fresh lock after holder e= xit + [4/4] unmount_during_reset umount during active reset (destroy-path w= akeup) + +Test 4 requires creating a second CephFS mount instance and SKIPs if +the host cannot do so. See `--help` output for details. + +## Troubleshooting + +**No writable Ceph reset interface found:** +Kernel lacks the reset feature, debugfs not mounted, or not root. +Check: `ls /sys/kernel/debug/ceph/*/reset/` + +**Multiple Ceph clients found:** +Use `--client-id` to select one. +List: `ls /sys/kernel/debug/ceph/` + +## Files + +| File | Role | +|------|------| +| `reset_stress.sh` | Single-client stress test runner | +| `validate_consistency.py` | Single-client post-run validator | +| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) | +| `run_validation.sh` | One-shot validation harness | diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/test= ing/selftests/filesystems/ceph/settings new file mode 100644 index 000000000000..79b65bdf05db --- /dev/null +++ b/tools/testing/selftests/filesystems/ceph/settings @@ -0,0 +1 @@ +timeout=3D1200 --=20 2.34.1